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Abstract 


A  mobile  robot  needs  an  internal  representation  of  its  environment  in  order  to  accomplish  its  mission. 
Building  such  a  representation  involves  transforming  raw  data  from  sensors  into  a  meaningful  geometric 
representation.  In  this  paper,  we  introduce  techniques  for  building  terrain  representations  from  range  data 
for  an  outdoor  mobile  robot.  We  introduce  three  levels  of  representations  that  correspond  to  levels  of 
planning:  obstacle  maps,  terrain  patches,  and  high  resolution  elevation  maps.  Since  terrain  representations 
from  individual  locations  are  not  sufficient  for  many  navigation  tasks,  we  also  introduce  techniques  for 
combining  multiple  maps.  Combining  maps  may  be  achieved  either  by  using  features  or  the  raw  elevation 
data.  Finally,  we  introduce  algorithms  for  combining  3-D  descriptions  with  descriptions  from  other 
sensors,  such  as  color  cameras.  We  examine  the  need  for  this  type  of  sensor  fusion  when  some  semantic 
information  has  to  be  extracted  from  an  observed  scene  and  provide  an  example  application  of  outdoor 
scene  analysis.  Many  of  the  techniques  presented  in  this  paper  have  been  tested  in  the  field  on  three 
mobile  robot  systems  developed  at  CMU. 
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1  Introduction 


A  mobile  robot  is  a  vehicle  that  navigates  autonomously  through  an  unknown  or  partially  known  environ¬ 
ment.  Research  in  the  field  of  mobile  robots  has  received  considerable  attention  in  the  past  decade  due 
to  its  wide  range  of  potential  applications,  from  surveillance  to  planetary  exploration,  and  the  research 
opportunities  it  provides,  including  virtually  the  whole  spectrum  of  robotics  research  from  vehicle  control 
to  symbolic  planning  (see  for  example  [18]  for  an  analysis  of  the  research  issues  in  mobile  robots).  In 
this  paper  we  present  our  investigation  of  some  the  issues  in  one  of  the  components  of  mobile  robots: 
perception.  The  role  of  perception  in  mobile  robots  is  to  transform  data  from  sensors  into  representations 
that  can  be  used  by  the  decision-making  components  of  the  system.  The  simplest  example  is  the  detection 
of  potentially  dangerous  regions  in  the  environment  (i.e.  obstacles)  that  can  be  used  by  a  path  planner 
whose  role  is  to  generate  safe  trajectories  for  the  vehicle.  An  example  of  a  more  complex  situation  is 
a  mission  that  requires  the  recognition  of  specific  landmarks,  in  which  case  the  perception  components 
must  produce  complex  descriptions  of  the  sensed  environment  and  relate  them  to  stored  models  of  the 
landmarks. 

There  are  many  sensing  strategies  for  perception  for  mobile  robots,  including  single  camera  systems, 
sonars,  passive  stereo,  and  laser  range  finders.  In  this  report,  we  focus  on  perception  algorithms  for 
range  sensors  that  provide  3-D  data  directly  by  active  sensing.  Using  such  sensors  has  the  advantage 
of  eliminating  the  calibration  problems  and  computational  costs  inherent  in  passive  techniques  such  as 
stereo.  We  describe  the  range  sensor  that  we  used  in  this  work  in  Section  2.  Even  though  we  tested  our 
algorithm  on  one  specific  range  sensor,  we  believe  that  the  sensor  characteristics  of  Section  2  are  fairly 
typical  of  a  wide  range  of  sensors  [4], 

Rcsw-:;h  ir.  perception  for  mobile  robots  is  not  only  sensor-dependent  but  it  is  also  dependent  on 
the  environment.  A  considerable  part  of  the  global  research  effort  has  concentrated  on  the  problem 
of  perception  for  mobile  robot  navigation  in  indoor  environments,  and  our  work  in  natural  outdoor 
environments  through  the  Autonomous  Land  Vehicle  and  Planetary  Exploration  projects  is  an  important 
development.  This  report  describes  some  of  the  techniques  we  have  developed  in  this  area  of  research. 
I  he  aim  of  ou»  work  is  to  produce  models  of  the  environment,  which  we  call  the  terrain,  for  path  planning 
and  object  recognition. 

The  algorithms  for  building  a  terrain  representation  from  a  single  sensor  frame  are  discussed  in 
Section  3  in  which  we  introduce  the  concept  of  dividing  the  terrain  representation  algorithms  into  three 
levels  depending  on  the  sophistication  of  the  path  planner  that  would  use  the  representation,  and  on  the 
anticipated  difficulty  of  the  terrain.  Since  a  mobile  robot  is  by  definition  a  dynamic  system,  it  must  process 
not  one,  but  many  observations  along  the  course  of  its  trajectory.  The  3-D  vision  algorithms  must  therefore 
be  able  to  reason  about  representations  that  are  built  from  sensory  data  taken  from  different  locations.  We 
investigate  this  type  of  algorithms  in  Section  4  in  which  we  propose  algorithms  for  matching  and  merging 
multiple  terrain  representations.  Finally,  the  3-D  vision  algorithms  that  we  propose  are  not  meant  to  be 
used  in  isolation,  they  have  to  be  eventually  integrated  in  a  system  that  include  other  sensors.  A  typical 
example  is  the  case  of  road  following  in  which  color  cameras  can  track  the  road,  while  a  range  sensor 
can  detect  unexpected  obstacles.  Another  example  is  a  mission  in  which  a  scene  must  be  interpreted 
in  order  to  identify  specific  objects,  in  which  case  all  the  available  sensors  must  contribute  to  the  final 
scene  analysis.  We  propose  some  algorithms  for  fusing  3-D  representations  with  representations  obtained 
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from  a  color  camera  in  Section  5.  We  also  describe  the  application  of  this  sensor  fusion  to  a  simple 
natural  scene  analysis  program.  Perception  techniques  for  mobile  robots  have  to  be  eventually  validated 
by  using  real  robots  in  real  environments.  We  have  implemented  the  3-D  vision  techniques  presented  in 
this  report  on  three  mobile  robots  developed  by  the  Field  Robotics  Center:  the  Tcrregator,  the  Navlab,  and 
the  Ambler.  The  Terregator  (Figure  1)  is  a  six-whccicd  vehicle  designed  for  rugged  terrain.  It  docs  not 
have  any  onboard  computing  units  except  for  the  low-level  control  of  the  actuators.  All  the  processing 
was  done  on  Sun  workstations  through  a  radio  connection.  We  used  this  machine  in  early  experiments 
with  range  data,  most  notably  the  sensor  fusion  experiments  of  Section  5.  The  Navlab  [36]  (Figure  2)  is  a 
converted  Chevy  van  designed  for  navigation  on  roads  or  on  mild  terrains.  The  Navlab  is  a  self-contained 
robot  in  that  all  the  computing  equipment  is  on  board.  The  results  presented  in  Sections  3.3  and  3.4 
come  from  the  3-D  vision  module  that  we  integrated  in  the  Navlab  system  [42],  The  Ambler  [2]  is  an 
hexapod  designed  for  the  exploration  of  Mars  (Figure  3).  This  vehicle  is  designed  for  navigation  on  very 
rugged  terrain  including  high  slopes,  rocks,  ar.d  wide  gullies.  This  entirely  new  design  prompted  us  to 
investigate  alternative  3-D  vision  algorithms  that  arc  reported  in  Section  3.5.  Even  though  the  hardware 
for  the  Ambler  docs  not  exist  at  this  time,  we  have  evaluated  the  algorithms  through  simulation  and 
careful  analysis  of  the  planetary  exploration  missions. 


Figure  1 :  The  Terregator 
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2  Active  range  and  reflectance  sensing  • 

The  basic  principle  of  active  sensing  techniques  is  to  observe  the  reflection  of  a  reference  signal  (sonar, 
laser,  radar.. etc.)  produced  by  an  object  in  the  environment  in  order  to  compute  the  distance  between  the 
sensor  and  that  object.  In  addition  to  the  distance,  the  sensor  may  report  the  intensity  of  the  reflected 


Figure  3:  The  Ambler 


signal  which  is  related  to  physical  surface  properties  of  the  object.  In  accordance  with  tradition,  wc  will 
refer  to  this  type  of  intensity  data  as  "reflectance"  data  even  though  the  quantity  measured  is  not  the  actual 
reflectance  coefficient  of  the  surface. 

Active  sensors  are  attractive  to  mobile  robots  researchers  for  two  main  reasons:  first,  they  provide 
range  data  without  the  computation  overhead  associated  with  conventional  passive  techniques  such  as 
stereo  vision,  which  is  important  in  time  critical  applications  such  as  obstacle  detection.  Second,  it  is 
largely  insensitive  to  outside  illumination  conditions,  simplifying  considerably  the  image  analysis  problem. 
This  is  especially  important  for  images  of  outdoor  scenes  in  which  illumination  cannot  be  controlled  or 
predicted.  For  example,  the  active  reflectance  images  of  outside  scenes  do  not  contain  any  shadows  from 
the  sun.  In  addition,  active  range  finding  technology  has  developed  to  the  extent  that  makes  it  realistic  to 
consider  it  as  part  of  practical  mobile  robot  implementations  in  the  short  term  [4J. 

The  range  sensor  we  used  is  a  time-of-flight  laser  range  finder  developed  by  the  Environmental 
Research  Institute  of  Michigan  (ERIM).  The  basic  principle  of  the  sensor  is  to  measure  the  difference  of 
phase  between  a  laser  beam  and  its  reflection  from  the  scene  [46].  A  two-mirror  scanning  system  allows 
the  beam  to  be  directed  anywhere  within  a  30°  x  80°  field  of  view.  The  data  produced  by  the  ERIM  sensor 
is  a  64  x  256  range  image,  the  range  is  coded  on  eight  bits  from  zero  to  64  feet,  which  corresponds  to  a 
range  resolution  of  three  inches.  All  measurements  are  all  relative  since  the  sensor  measures  differences 
of  phase.  That  is,  a  range  value  is  known  modulo  64  feet.  We  have  adjusted  the  sensor  so  that  the  range 
value  0  corresponds  to  the  mirrors  for  all  the  images  presented  in  this  report.  In  addition  to  range  images, 
the  sensor  also  produces  active  reflectance  images  of  the  same  format  (64  x  256  x  8  bits),  the  reflectance 
at  each  pixel  encodes  the  energy  of  the  reflected  laser  beam  at  each  point.  Figure  5  shows  a  pair  of  range 
and  reflectance  images  of  an  outdoor  scene.  The  next  two  Sections  describe  the  range  and  reflectance 
data  in  more  details. 


2.1  From  range  pixels  to  points  in  space 

The  position  of  a  point  in  a  given  coordinate  system  can  be  derived  from  the  measured  range  and  the 
direction  of  the  beam  at  that  point.  Wc  usually  use  the  Cartesian  coordinate  system  shown  in  Figure  4. 
in  which  case  the  coordinates  of  a  point  measured  by  the  range  sensor  arc  given  by  the  equations1: 

x  =  DsinO  (1) 

y  =  D  cos  cos  0 
z  -  D  sin  o  cos  0 

where  o  and  6  arc  the  vertical  and  horizontal  scanning  angles  of  the  beam  direction.  The  two  angles 
arc  derived  from  the  row  and  column  position  in  the  range  image,  (r.c),  by  the  equations: 


0  =  0O  +  c  x  AH 
O  =  </>()  +  r  x  _Ao 


Note  that  the  reference  coordinate  system  is  not  the  same  as  in  [20|  for  consistency  reasons 


where  do  (respectively  <Po )  is  the  starting  horizontal  (respectively  vertical;  scanning  angle,  and  A9  (re¬ 
spectively  A4>)  is  the  angular  step  between  two  consecutive  columns  (respectively  rows).  Figure  6  shows 
an  overhead  view  of  the  scene  of  Figure  5,  the  coordinates  of  the  points  are  computed  using  Eq.  (3). 


Figure  4:  Geometry  of  the  range  sensor 


Figure  5:  Range  and  reflectance  images 


2.2  Reflectance  images 

A  reflectance  image  from  the  ERIM  sensor  is  an  image  of  the  energy  reflected  by  a  laser  beam.  Unlike 
conventional  intensity  images,  this  data  provides  us  with  information  which  is  to  a  large  extent  independent 


I 
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Figure  6:  Overhead  view 


of  the  environmental  illumination.  In  particular,  the  reflectance  images  contain  no  shadows  from  outside 
illumination.  The  measured  energy  does  depend,  however,  on  the  shape  of  the  surface  and  its  distance  to 
the  sensor.  We  correct  the  image  so  that  the  pixel  values  are  functions  only  of  the  material  reflectance. 
The  measured  energy,  Preturn ,  depends  on  the  specific  material  reflectance,  p,  the  range,  D,  and  the  angle 
of  incidence,  7: 


P 


return 


Kp COS  * 

_____ 


(3) 


Due  to  the  wide  range  of  Preturn ,  the  value  actually  reported  in  ihe  reflectance  image  is  co  1  pressed 
by  using  a  log  transform.  That  is,  the  digitized  value,  P image  is  of  the  form  [44]: 

P unage  =  A  l0g(p  COS  j)  +  B\ogD  (4) 

where  A  and  B  are  constants  that  depend  only  on  the  characteristics  of  the  laser,  the  circuitry  used  for  the 
digitization,  and  the  physical  properties  of  the  ambiant  atmosphere.  Since  A  and  B  cannot  be  compued 
directly,  we  use  a  calibration  procedure  in  which  a  homogeneous  flat  region  is  selected  in  a  training  image; 
we  then  use  the  pixels  in  this  region  to  estimate  A  and  B  by  least-squares  fitting  Eq.  14')  to  the  actual 
reflectance/range  data.  Given  A  and  B,  we  correct  subsequent  images  by: 


P new— image  “  (Punage-  B\OgD)/A  (5) 

The  value  P new- image  depends  only  on  the  material  reflectance  and  the  angle  of  incidence.  This  is  a 
sufficient  approximation  for  our  purposes  since  for  smooth  surfaces  such  as  smooth  terrain,  the  cos', 
factor  does  not  vary  widely.  For  efficiency  purposes,  the  right-hand  side  of  (5)  is  precomputed  for  all 
possible  combinations  ( P image- D )  and  stored  in  a  lookup  table.  Figure  5  shows  an  example  of  an  ERIM 
image,  and  Figure  7  shows  the  resulting  corrected  image. 
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Figure  7:  Corrected  reflectance  image 


2.3  Resolution  and  noise 

As  is  the  case  with  any  sensor,  the  range  sensor  returns  values  that  are  measured  with  a  limited  resolution 
which  are  corrupted  by  measurement  noise.  In  the  case  of  the  ERIM  sensor,  the  main  source  of  noise 
is  due  to  the  fact  that  the  laser  beam  is  not  a  line  in  space  but  rather  a  cone  whose  opening  is  a  0.5° 
solid  angle  (the  instantaneous  field  of  view).  The  value  returned  at  each  pixel  is  actually  the  average  of 
the  range  of  values  over  a  2-D  area,  the  footprint,  which  is  the  intersection  of  the  cone  with  the  target 
surface  (Figure  8).  Simple  geometry  shows  that  the  area  of  the  footprint  is  proportional  to  the  square  of 
the  range  at  its  center.  The  size  of  the  footprint  also  depends  on  the  angle  6  between  tlv  surface  normal 
and  the  beam  as  shown  in  Figure  8.  The  size  of  the  footprint  is  roughly  inversely  proportional  to  cos  9 
if  we  assume  that  the  footprint  is  small  enough  and  that  6  is  almost  constant.  Therefore,  a  first  order 
approximation  of  the  standard  deviation  of  the  range  noise,  cr  is  given  by: 


er  oc 


P2 
cos  6 


(6) 


The  proportionality  factor  in  this  equation  depends  on  the  characteristics  of  the  laser  transmitter,  the 
outside  illumination,  and  the  reflectance  p  of  the  surface  which  is  assumed  constant  across  the  footprint 
in  this  first  order  approximation.  We  validated  the  model  of  Equation  6  by  estimating  the  RMS  error 
of  the  range  values  on  a  sequence  of  images.  Figure  9  shows  the  standard  deviation  with  respect  to  the 
measured  range.  The  Figure  shows  that  o  follows  roughly  the  D 2  behavior  predicted  by  the  first  order 
model.  The  footprint  affects  all  pixels  in  the  image. 

There  are  other  effects  that  produce  distortions  only  at  specific  locations  in  the  image.  The  main  effect 
is  known  as  the  "mixed  point"  problem  and  is  illustrated  in  Figure  8  in  which  the  laser  footprint  crosses 
the  edge  between  two  objects  that  are  far  from  each  other.  In  that  case,  the  returned  range  value  is  some 
combination  of  the  range  of  the  two  objects  but  does  not  have  any  physical  meaning.  This  problem  makes 
the  accurate  detection  of  occluding  edges  more  difficult.  Another  effect  is  due  to  the  reflectance  properties 
of  the  observed  surface;  if  the  surface  is  highly  specular  then  no  laser  reflection  can  be  observed.  In  that 
case  the  ERIM  sensor  returns  a  value  of  255.  This  effect  is  most  noticeable  on  man-made  objects  that 
contain  a  lot  of  polished  metallic  surfaces.  It  should  be  mentioned,  however,  that  the  noise  characteristics 
of  the  ERIM  sensor  are  fairly  typical  of  the  behavior  of  active  range  sensors  [51. 


7 


1H  <H 


ItM  t»l  MH  1MI  1IM 


Figure  9:  Noise  in  range  data 
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3  Terrain  representations 

The  main  task  of  3-D  vision  in  a  mobile  robot  system  is  to  provide  sufficient  information  to  the  path 
planner  so  that  the  vehicle  can  be  safely  steered  through  its  environment.  In  the  case  of  outdoor  navigation, 
the  task  is  to  convert  a  range  image  into  a  representation  of  the  terrain.  We  use  the  word  "terrain"  in 
a  very  loose  sense  in  that  we  mean  both  the  ground  surface  and  the  objects  that  may  appear  in  natural 
environments  (e.g.  rocks  or  trees).  In  this  Section  we  discuss  the  techniques  that  we  have  implemented 
for  the  Navlab  and  Mars  Rover  systems.  We  first  introduce  the  concept  of  the  elevation  map  as  a  basis  for 
terrain  representations  and  its  relationship  with  different  path  planning  techniques,  rhe  last  four  Sections 
spell  out  the  technical  details  of  the  terrain  representation  algorithms. 

3.1  The  elevation  map  as  the  data  structure  for  terrain  representation 

Even  though  the  format  of  the  range  data  is  an  image,  this  may  not  be  the  most  suitable  structuring  of  the 
data  for  extracting  information.  For  example  ,  a  standard  representation  in  3-D  vision  for  manipulation 
is  to  view  a  range  image  as  a  set  of  data  points  measured  on  a  surface  of  the  equation  z  =/(x,y)  where 
the  x-  and  y- axes  are  parallel  to  the  axis  of  the  image  and  z  is  the  measured  depth.  This  choice  of  axis 
is  natural  since  the  image  plane  is  usually  parallel  to  the  plane  of  the  scene.  In  our  case,  however,  the 
"natural”  reference  plane  is  not  the  image  plane  but  is  the  ground  plane.  In  this  context,  "ground  plane” 
refers  to  a  plane  that  is  horizontal  with  respect  to  the  vehicle  or  to  the  gravity  vector.  The  representation 
z  =  /(*,  y)  is  then  the  usual  concept  of  an  elevation  map.  To  transform  the  data  points  into  an  elevation 
map  is  useful  only  if  one  has  a  way  to  access  them.  The  most  common  approach  is  to  discretize  the  (x.y) 
plane  into  a  grid.  Each  grid  cell  (Xi,yd  is  the  trace  of  a  vertical  column  in  space,  its  field  (Figure  10).  All 
the  data  that  falls  within  a  cell’s  field  is  stored  in  that  cell.  The  description  shown  in  Figure  10  does  not 
necessarily  reflect  the  actual  implementation  of  an  elevation  map  but  is  more  of  a  framework  in  which  we 
develop  the  terrain  representation  algorithms.  As  we  shall  see  later,  the  actual  implementation  depends 
on  the  level  of  detail  that  needs  to  be  included  in  the  terrain  description. 

Although  the  elevation  map  is  a  natural  concept  for  terrain  representations,  it  exhibits  a  number  of 
problems  due  to  the  conversion  of  a  regularly  sampled  image  to  a  different  reference  plane  [25].  Although 
we  propose  solutions  to  these  problems  in  Section  3.5,  it  is  important  to  keep  them  in  mind  while  we 
investigate  other  terrain  representations.  The  first  problem  is  the  sampling  problem  illustrated  in  Figure  1 1 . 
Since  we  perform  some  kind  of  image  warping,  the  distribution  of  data  points  in  the  elevation  map  is 
not  uniform,  and  as  a  result  conventional  image  processing  algorithms  cannot  be  applied  directly  to  the 
map.  There  are  two  ways  to  get  around  the  sampling  problem:  We  can  either  use  a  base  structure 
that  is  not  a  regularly  spaced  grid,  such  as  a  Delaunay  triangulation  of  the  data  points  [33],  or  we  can 
interpolate  between  data  points  to  build  a  dense  elevation  map.  The  former  solution  is  not  very  practical 
because  of  the  complex  algorithms  required  to  access  data  points  and  their  neighborhoods.  We  describe 
an  implementation  of  the  latter  approach  in  Section  3.5.  A  second  problem  with  elevation  maps  is  the 
representation  of  the  range  shadows  created  by  some  objects  (Figure  12).  Since  no  information  is  available 
within  the  shadowed  regions  of  the  map,  we  must  represent  them  separately  so  that  no  interpolation  takes 
place  across  them  and  no  "phantom"  features  are  reported  to  the  path  planner.  Finally,  we  have  to  convert 
the  noise  on  the  original  measurements  into  a  measure  of  uncertainty  on  the  z  value  at  each  grid  point 
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Figure  10:  Structure  of  an  elevation  map 


( x ,  y).  This  conversion  is  difficult  due  to  the  fact  that  the  sensor’s  uncertainty  is  most  naturally  represented 
with  respect  to  the  direction  of  measurement  (Figure  13)  and  therefore  spreads  across  a  whole  region  in 
the  elevation  map. 

Sensor 


3.2  Terrain  representations  and  path  planners 

The  choice  of  a  terrain  representation  depends  on  the  path  planner  used  for  actually  driving  the  vehicle. 
For  example,  the  family  of  planners  derived  from  the  Lozano- Perez’s  A'  approach  [28]  uses  discrete 
obstacles  represented  by  2-D  polygons.  By  contrast,  planners  that  compare  a  vehicle  model  with  the  local 
terrain  [9,38]  use  some  intermediate  representation  of  the  raw  elevation  map.  Furthermore,  the  choice  of 
a  terrain  representation  and  a  path  planner  in  turn  depend  on  the  environment  in  which  the  vehicle  has  to 
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Figure  12:  An  example  of  a  range  shadow 
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navigate.  For  example,  representing  only  a  small  number  of  discrete  upright  objects  may  be  appropriate 
if  it  is  known  in  advance  that  the  terrain  is  mostly  flat,  (e.g.  a  road)  with  a  few  obstacles  (e.g.  trees) 
while  cross  country  navigation  requires  a  more  detailed  description  of  the  elevation  map.  Generating  the 
most  detailed  description  and  then  extracting  the  relevant  information  is  not  an  acceptable  solution  since 
it  would  significantly  degrade  the  performance  of  the  system  in  simple  environments.  Therefore,  we 
need  several  levels  of  terrain  representation  corresponding  to  different  resolutions  at  which  the  terrain  is 
described  (Figure  14).  At  the  low  resolution  level  we  describe  only  discrete  obstacles  without  explicitly 
describing  the  local  shape  of  the  terrain.  At  the  medium  level,  we  include  a  description  of  the  terrain 
through  surface  patches  that  correspond  to  significant  terrain  features.  At  that  level,  the  resolution  is  the 
resolution  of  the  operator  used  to  detect  these  features.  Finally,  the  description  with  the  highest  resolution 
is  a  dense  elevation  map  whose  resolution  is  limited  only  by  the  sensor.  In  order  to  keep  the  computations 
involved  under  control,  the  resolution  is  typically  related  to  the  size  of  the  vehicle’s  parts  that  enter  in 
contact  with  the  terrain.  For  example,  the  size  of  one  foot  is  used  to  compute  the  terrain  resolution  in  the 
case  of  a  legged  vehicle. 


Figure  14:  Levels  of  terrain  representation 


3.3  Low  resolution:  Obstacle  map 

The  lowest  resolution  terrain  representation  is  an  obstacle  map  which  contains  a  small  number  of  obstacles 
represented  by  their  trace  on  the  ground  plane.  Several  techniques  have  been  proposed  for  obstacle 
detection.  The  Martin-Marietta  ALV  [10,11,431  detects  obstacles  by  computing  the  difference  between 


12 


the  observed  range  image  and  pre-computed  images  of  ideal  ground  at  several  different  slope  angles. 
Points  that  are  far  from  the  ideal  ground  planes  are  grouped  into  regions  that  are  reported  as  obstacles 
to  a  path  planner.  A  very  fast  implementation  of  this  technique  is  possible  since  it  requires  only  image 
differences  and  region  grouping.  It  makes,  however,  very  strong  assumptions  on  the  shape  of  the  terrain. 
It  also  takes  into  account  only  the  positions  of  the  potential  obstacle  point,  and  as  a  result  a  very  high 
slope  ridge  that  is  not  deep  enough  would  not  be  detected. 

Another  approach  proposed  by  Hughes  AI  group  [8]  is  to  cjtect  the  obstacles  by  thresholding  the 
normalized  range  gradient,  AD/D,  and  by  thresholding  the  radial  slope,  DAcp/  AD.  The  first  test  detects 
the  discontinuities  in  range,  while  the  second  test  detects  the  portion  of  the  terrain  with  high  slope.  This 
approach  has  the  advantage  of  taking  a  vehicle  model  into  account  when  deciding  whether  a  point  is  part 
of  an  obstacle.  We  used  the  terrain  map  paradigm  to  detect  obstacles  for  the  Navlab.  Each  cell  of  the 
terrain  contains  the  set  of  data  points  that  fall  within  its  field  (Figure  10).  We  can  then  estimate  surface 
normal  and  curvatures  at  each  elevation  map  cell  by  fitting  a  reference  surface  to  the  corresponding  set 
of  data  points.  Cells  that  have  a  high  curvature  or  a  surface  normal  far  from  the  vehicle’s  idea  of  the 
vertical  direction  are  reported  as  part  of  the  projection  of  an  obstacle.  Obstacle  cells  are  then  grouped 
into  regions  corresponding  to  individual  obstacles.  The  final  product  of  the  obstacle  detection  algorithm 
is  a  set  of  2-D  polygonal  approximations  of  the  boundaries  of  the  detected  obstacles  that  is  sent  to  an 
A’ -type  path  planner  (Figure  15).  In  addition,  we  can  roughly  classify  the  obstacles  into  holes  or  bumps 
according  to  the  shape  of  the  surfaces  inside  the  polygons. 

Figure  16  shows  the  result  of  applying  the  obstacle  detection  algorithm  to  a  sequence  of  ERIM  images. 
The  Figure  shows  the  original  range  images  (top),  the  range  pixels  projected  in  the  elevation  map  (left), 
and  the  resulting  polygonal  obstacle  map  (right).  The  large  enclosing  polygon  in  the  obstacle  map  is  the 
limit  of  the  visible  portion  of  the  world.  The  obstacle  detection  algorithm  does  not  make  assumptions  on 
the  position  of  the  ground  plane  in  that  it  only  assumes  that  the  plane  is  roughly  horizontal  with  respect  to 
the  vehicle.  Computing  the  slopes  within  each  cell  has  a  smoothing  effect  that  may  cause  real  obstacles 
to  be  undetected.  Therefore,  the  resolution  of  the  elevation  map  must  be  chosen  so  that  each  cell  is 
significantly  larger  than  the  typical  expected  obstacles.  In  the  case  of  Figure  16,  the  resolution  is  twenty 
centimeters.  The  size  of  the  detectable  obstacle  also  varies  with  the  distance  from  the  vehicle  due  to  the 
sampling  problem  (Section  3.1). 

One  major  drawback  of  our  obstacle  detection  algorithm  is  that  the  computation  of  the  slopes  and 
curvatures  at  each  cell  of  the  elevation  map  is  an  expensive  operation.  Furthermore,  since  low-resolution 
obstacle  maps  are  most  useful  for  fast  navigation  through  simple  environments,  it  is  important  to  have  a  fast 
implementation  of  the  obstacle  detection  algorithm.  A  natural  optimization  is  to  parallelize  the  algorithm 
by  dividing  the  elevation  map  into  blocks  that  are  processed  simultaneously.  We  have  implemented  such  a 
parallel  version  of  the  algorithm  on  a  ten-processor  Warp  computer  [45,21],  The  parallel  implementation 
reduced  the  cycle  time  to  under  two  seconds,  thus  making  it  possible  to  use  the  obstacle  detection 
algorithm  for  fast  navigation  of  the  Navlab.  In  that  particular  implementation,  the  vehicle  was  moving 
at  a  continuous  speed  of  one  meter  per  second,  taking  range  images,  detecting  obstacles,  and  planning  a 
path  every  four  meters. 
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Range  image 


Figure  15:  Building  the  obstacle  map 


3.4  Medium  resolution:  Polygonal  terrain  map 

Obstacle  detection  is  sufficient  for  navigation  in  flat  terrain  with  discrete  obstacles,  such  as  following  a 
road  bordered  by  trees.  We  need  a  more  detailed  description  when  the  terrain  is  uneven  as  in  the  case 
of  cross-country  navigation.  For  that  purpose,  an  elevation  map  could  be  used  directly  [9]  by  a  path 
planner.  This  approach  is  costly  because  of  the  amount  of  data  to  be  handled  by  the  planner  which  does 
not  need  such  a  high  resolution  description  to  do  the  job  in  many  cases  (although  we  will  investigate 
some  applications  in  which  a  high  resolution  representation  is  required  in  Section  3.5).  An  alternative  is 
to  group  smooth  portions  of  the  terrain  into  regions  and  edges  that  are  the  basic  units  manipulated  by 
the  planner.  This  set  of  features  provides  a  compact  representation  of  the  terrain  thus  allowing  for  more 
efficient  planning  [38], 

The  features  used  are  of  two  types:  smooth  regions,  and  sharp  terrain  discontinuities.  The  terrain 
discontinuities  are  either  discontinuities  of  the  elevation  of  the  terrain,  as  in  the  case  of  a  hole,  or 
discontinuities  of  the  surface  normals,  as  in  the  case  of  the  shoulder  of  a  road  [3],  We  detect  both  types 
of  discontinuities  by  using  an  edge  detector  over  the  elevation  map  and  the  surface  normals  map.  The 
edges  correspond  to  small  regions  on  the  terrain  surface.  Once  we  have  detected  the  discontinuities,  we 
segment  the  terrain  into  smooth  regions.  The  segmentation  uses  a  region  growing  algorithm  that  first 
identifies  the  smoothest  locations  in  the  terrain  based  on  the  behavior  of  the  surface  normals,  and  then 
grows  regions  around  those  locations.  The  result  of  the  processing  is  a  covering  of  the  terrain  by  regions 
corresponding  either  to  smooth  portions  or  to  edges. 

The  final  representation  depends  on  the  planner  that  uses  it.  In  our  case,  the  terrain  representation  is 
embedded  in  the  Navlab  system  using  the  path  planner  described  in  [39j.  The  basic  geometric  object  used 
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by  the  system  is  the  three-dimensional  polygon.  We  therefore  approximate  the  boundary  of  each  region 
by  a  polygon.  The  approximation  is  done  in  a  way  that  ensures  consistency  between  regions  in  that  the 
polygonal  boundaries  of  neighboring  regions  share  common  edges  and  vertices.  This  guarantees  that  no 
"gaps"  exist  in  the  resulting  polygonal  mesh.  This  is  important  from  the  point  of  view  of  the  path  planner 
since  such  gaps  would  be  interpreted  as  unknown  portions  of  the  terrain.  Each  region  is  approximated 
by  a  planar  surface  that  is  used  by  the  planner  to  determine  the  traversability  of  the  regions.  Since  the 
regions  are  not  planar  in  reality,  the  standard  deviation  of  the  parameters  of  the  plane  is  associated  with 
each  region. 

Figure  18  shows  the  polygonal  boundaries  of  the  regions  extracted  from  the  image  of  Figure  17.  In 
this  implementation,  the  resolution  of  the  elevation  map  is  twenty  centimeters.  Since  we  need  a  dense 
map  in  order  to  extract  edges,  we  interpolated  linearly  between  the  sparse  points  of  the  elevation  map. 
Figure  17  shows  the  interpolated  elevation  map.  This  implementation  of  a  medium  resolution  terrain 
representation  is  integrated  in  the  Navlab  system  and  will  be  part  of  the  standard  core  system  for  our 
future  mobile  robot  systems. 


Figure  17:  Range  image  and  elevation  map 


3.5  High  resolution:  Elevation  maps  for  rough  terrain 

The  elevation  map  derived  directly  from  the  sensor  is  sparse  and  noisy,  especially  at  greater  distances 
from  the  sensor.  Many  applications,  however,  need  a  dense  and  accurate  high  resolution  map.  One  way 
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Figure  18:  Polygonal  boundaries  of  terrain  regions 


to  derive  such  a  map  is  to  interpolate  between  the  data  points  using  some  mathematical  approximation 
of  the  surface  between  data  points.  The  models  that  can  be  used  include  linear,  quadratic,  or  bicubic 
surfaces  [33].  Another  approach  is  to  fit  a  surface  globally  under  some  smoothness  assumptions.  This 
approach  includes  the  family  of  regularization  algorithms  [6]  in  which  a  criterion  of  the  form: 


^interpolation  1 1  ^  f  f  (^interpolation) 


(7) 


is  minimized,  where  f  is  a  regularization  function  that  reflects  the  smoothness  model  (eg.  thin  plate). 
Two  problems  arise  with  both  interpolation  approaches:  They  make  apriori  assumptions  on  the  local 
shape  of  the  terrain  which  may  not  be  valid  (e.g.  in  the  case  of  very  rough  terrain),  and  they  do  not  take 
into  account  the  image  formation  process  since  they  are  generic  techniques  independent  of  the  origin  of 
the  data.  In  addition,  the  interpolation  approaches  depend  heavily  on  the  resolution  and  position  of  the 
reference  grid.  For  example,  they  cannot  compute  an  estimate  of  the  elevation  at  an  Ct.y)  position  that  is 
not  a  grid  point  without  resampling  the  grid.  We  propose  an  alternative,  the  locus  algorithm  [25],  that  uses 
a  model  of  the  sensor  and  provides  interpolation  at  arbitrary  resolution  without  making  any  assumptions 
on  the  terrain  shape  other  than  the  continuity  of  the  surface. 


3.5.1  The  locus  algorithm  for  the  optimal  interpolation  of  terrain  maps 

The  problem  of  finding  the  elevation  z  of  a  point  (x,y)  is  trivially  equivalent  to  computing  the  intersection 
of  the  surface  observed  by  the  sensor  and  the  vertical  line  passing  through  (x.y).  The  basic  idea  of  the 
locus  algorithm  is  to  convert  the  latter  formulation  into  a  problem  in  image  space  (Figure  19).  A  vertical 
line  is  a  curve  in  image  space,  the  locus,  whose  equation  as  a  function  of  o  is: 

D  =  Dt(<t>)=J~?- t-+x2  (8) 

V  COSZ0 
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where  <p,  0,  and  D  are  defined  as  in  Section  2.  Equation  (9)  was  derived  by  inverting  Equation  (2),  and 
assuming  x  and  y  constant.  Similarly,  the  range  image  can  be  viewed  as  a  surface  D  =  I(o,8)  in  o,0, 
D  space.  The  problem  is  then  to  find  the  intersection,  if  it  exists,  between  a  curve  parametrized  by  o 
and  a  discrete  surface.  Since  the  surface  is  known  only  from  a  sample  of  data,  the  intersection  cannot 
be  computed  analytically.  Instead,  we  have  to  search  along  the  curve  for  the  intersection  point.  The 
search  proceeds  in  two  stages:  We  first  locate  the  two  scanlines  of  the  range  image,  <p i  and  Co>  between 
which  the  intersection  must  be  located,  that  is  the  two  consecutive  scanlines  such  that,  Diff(o i)  = 
Di(d>\)  -  and  Difff^-i)  -  l(4>2,8i(,<t>i))  have  opposite  signs,  where  (?;(<?)  is  the  image 

column  that  is  the  closest  to  #/(<£).  We  then  apply  a  binary  search  between  d>\  and  <?2-  The  search  stops 
when  the  difference  between  the  two  angles  <pn  and  <pn+\,  where  Diff(<p„)  and  Diff{<?n+\)  have  opposite 
signs,  is  lower  than  a  threshold  e.  Since  there  are  no  pixels  between  $ \  and  02,  we  have  to  perform  a 
local  quadratic  interpolation  of  the  image  in  order  to  compute  #/(<>)  and  Di(o)  for  6\  <  o  <  02.  The 
control  points  for  the  interpolation  are  the  four  pixels  that  surround  the  intersection  point  (Figure  20).  The 
final  result  is  a  value  <p  that  is  converted  to  an  elevation  value  by  applying  Equation  (2)  to  o.  61(0).  Di(o). 
The  resolution  of  the  elevation  is  controlled  by  the  choice  of  the  parameter  e. 

The  locus  algorithm  enables  us  to  evaluate  the  elevation  at  any  point  since  we  do  not  assume  the 
existence  of  a  grid.  Figure  21  shows  the  result  of  applying  the  locus  algorithm  on  range  images  of  uneven 
terrain,  in  this  case  a  construction  site.  The  Figure  shows  the  original  range  images  and  the  map  displayed 
as  an  isoplot  surface.  The  centers  of  the  grid  cells  are  ten  centimeters  apart  in  the  (x.  y)  plane. 


3.5.2  Generalizing  the  locus  algorithm 


We  can  generalize  the  locus  algorithm  from  the  case  of  a  vertical  line  to  the  case  of  a  general  line  in 
space.  This  generalization  allows  us  to  build  maps  using  any  reference  plane  instead  of  being  restricted 
to  the  (x,y)  plane.  This  is  important  when,  for  example,  the  sensor’s  (x,  y)  plane  is  not  orthogonal  to  the 
gravity  vector.  A  line  in  space  is  defined  by  a  point  u  =  [«*,  Uy,uz]‘,  and  a  unit  vector  v  =  [vx.  vy.  v2]'. 
Such  a  line  is  parametrized  in  A  by  the  relation  p  =  u+  Av  if  p  is  a  point  on  the  line.  A  general  line  is 
still  a  curve  in  image  space  that  can  be  parametrized  in  o  if  we  assume  that  the  line  is  not  parallel  to  the 
(x,  y)  plane.  The  equation  of  the  curve  becomes: 


D,(0) 

0i(q) 

Mo) 


\J  (v*A(<?)  +  ux )2  +  ( Vy\(<p )  +  Uy)2  +  (vzA(c>)  +  uz)2 
VxM<P)  +  Ux 


arcsin 


D 


Uy  tan  o  uz 
v*  -  vy  tan  o 


(10) 


We  can  then  compute  the  intersection  between  the  curve  and  the  image  surface  by  using  the  same  algorithm 
as  before  except  that  we  have  to  use  Equation  (10)  instead  of  Equation  (9). 

The  representation  of  the  line  by  the  pair  (u.  v)  is  not  optimal  since  it  uses  six  parameters  while  only 
four  parameters  are  needed  to  represent  a  line  in  space.  For  example,  this  can  be  troublesome  if  we  want 
to  compute  the  Jacobian  of  the  intersection  point  with  respect  to  the  parameters  of  the  line.  A  better 
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Figure  20:  Image  interpolation  around  the  intersection  point 

alternative  [22]  is  to  represent  the  line  by  its  slopes  in  x  and  y  and  by  its  intersection  with  the  plane  z  -  0 
(See  [35]  for  a  complete  survey  of  3-D  line  representations).  The  equation  of  the  line  then  becomes: 

x-az  +  p  (11) 

y  =  bz  +  q 

We  can  still  use  Equation  (10)  to  compute  the  locus  because  we  can  switch  between  the  ( a.b.p.q )  and 
( u .  v)  representations  by  using  the  Equations: 


v=  b  ,  u  =  q  (12) 

_  1  J  [  0 

“x  v* 

a=  —,p= - vr 

uz 

Uy  Vy 

b  =  —,q  -  ~—vz 
uz  uz 

In  the  subsequent  Sections,  we  will  denote  by  h(a,b.p,q)  the  function  from  /?4  to  that  maps  a  line  in 
space  to  the  intersection  point  with  the  range  image. 
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Figure  21:  The  locus  algorithm  on  range  images 


Figure  21:  The  locus  algorithm  on  range  images  (Continued) 


23 


Figure  21:  The  locus  algorithm  on  range  images  (Continued) 


24 


3.5.3  Evaluating  the  locus  algorithm 

We  evaluate  the  locus  algorithm  by  comparing  its  performance  with  the  other  "naive"  interpolation  algo¬ 
rithms  on  a  set  of  synthesized  range  images  of  simple  scenes.  The  simplest  scenes  are  planes  at  various 
orientations.  Furthermore,  we  add  some  range  noise  using  the  model  of  Section  2.3  in  order  to  evaluate 
the  robustness  of  the  approach  in  the  presence  of  noise.  The  performances  of  the  algorithms  are  evaluated 
by  using  the  mean  square  error 

(,3) 

where  hi  is  the  true  elevation  value  and  hi  is  the  estimated  elevation.  Figure  22  plots  £  for  the  locus 
algorithm  and  the  naive  interpolation  as  a  function  of  the  slope  of  the  observed  plane  and  the  noise  level. 
This  result  shows  that  the  locus  algorithm  is  more  stable  with  respect  to  surface  orientation  and  noise 
level  than  the  other  algorithm.  This  is  due  to  the  fact  that  we  perform  the  interpolation  in  image  space 
instead  of  first  converting  the  data  points  into  the  elevation  map. 


error 


x,  xx  :  Locus  method 
o,  oo  :  Elevation  GNC 
method 

x,  o  :  S/N  ratio  1000 
xx,  oo  :  S/N  ratio  100 


10  20  30  40  tit  angle 

Figure  22:  Evaluation  of  the  locus  algorithm  on  synthesized  images 


3.5.4  Representing  the  uncertainty 

We  have  presented  in  Section  2.3  a  model  of  the  sensor  noise  that  is  a  Gaussian  distribution  along  the 
direction  of  measurement.  We  need  to  transform  this  model  into  a  model  of  the  noise,  or  uncertainty, 
on  the  elevation  values  returned  by  the  locus  algorithm.  The  difficulty  here  is  that  the  uncertainty  in  a 
given  range  value  spreads  to  many  points  in  the  elevation  map,  no  matter  how  the  map  is  oriented  with 
respect  to  the  image  plane  (Figure  13).  We  cannot  therefore  assume  that  the  standard  deviation  of  an 
elevation  is  the  same  as  the  one  of  the  corresponding  pixel  in  the  range  image.  Instead,  we  propose  to 
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use  the  nature  of  the  locus  algorithm  itself  to  derive  a  meaningful  value  for  the  elevation  uncertainty.  To 
facilitate  the  explanation,  we  consider  only  the  case  of  the  basic  locus  algorithm  of  Section  3.5.1  in  which 
we  compute  an  elevation  z  from  the  intersection  of  the  locus  of  a  vertical  line  with  a  depth  profile  from  a 
range  image.  Figure  23  shows  the  principle  of  the  uncertainty  computation  by  considering  a  locus  curve 
that  corresponds  to  a  line  in  space  and  the  depth  profile  from  the  range  image  in  the  neighborhood  of  the 
intersection  point,  each  point  on  the  depth  profile  has  an  uncertainty  whose  density  can  be  represented  by 
a  Gaussian  distribution  as  computed  in  Section  2.3.  The  problem  is  to  define  a  distribution  of  uncertainty 
along  the  line.  The  value  of  the  uncertainty  reflects  how  likely  is  the  given  point  to  be  on  the  actual 
surface  given  the  measurements. 

Let  us  consider  an  elevation  h  along  the  vertical  line.  This  elevation  corresponds  to  a  measurement 
direction  <p(h)  and  a  measured  range  d'(h).  If  d(h)  is  the  distance  between  the  origin  and  the  elevation  h, 
we  assign  to  h  the  confidence  [39]: 


m  = 


V2*a(d'(h)) 


(M)-S(k»2 


(14) 


where  cr(ct(h)  is  the  variance  of  the  measurement  at  the  range  d!(h).  Equation  14  does  not  tell  anything 
about  the  shape  of  the  uncertainty  distribution  1(h)  along  the  h  axis  except  that  it  is  maximum  at  the 
elevation  ho  at  which  d(h)  =  d'(h),  that  is  the  elevation  returned  by  the  locus  algorithm.  In  order  to 
determine  the  shape  of  1(h),  we  approximate  1(h)  around  h0  by  replacing  the  surface  by  its  tangent  plane 
at  ho.  If  a  is  the  slope  of  the  plane,  and  H  is  the  elevation  of  the  intersection  of  the  plane  with  the  2  axis, 
we  have: 


<r(ctm 

(d(h)  -  d'(h ))2 
2  a(d'(h))2 


rH2(a2  +  h2) 

(a  tan  a  +  h)2 
(h  -  ho  )2(a  tan  a  +  h)2 
K^H^a2  +  h2) 


(15) 


(16) 


where  a  is  the  distance  between  the  line  and  the  origin  in  the  x  -  y  plane  and  K  is  defined  in  Section  2.3 
by  a(d)  as  Kd2.  By  assuming  that  h  is  close  to  ho,  that  is  h  =  h0  +  e  with  e  <  ho,  and  by  using  the  fact 
that  H  =  ho  +  a  tana,  we  have  the  approximations: 


o(ct(h)) 
(d(h)  -  ct(h))2 
la(d'(h))2 


K(a2  +  hi) 

(h-  ho)2 

2K2H2(a2  +  h\) 


(17) 


(18) 


In  the  neighborhood  of  ho,  Equation  18  shows  that  (d(h)  -  d'(h))2/2a(cf(h))2  is  quadratic  in  h  -  h0, 
and  that  o(d!(h))  is  constant.  Therefore,  1(h)  can  be  approximated  by  a  Gaussian  distribution  of  variance: 

a  l  =  K2H2(a2  +  h £)  =  K2H2cf0  (19) 
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Figure  23:  Computing  the  uncertainty  from  the  locus  algorithm 

Equation  19  provides  us  with  a  first  order  model  of  the  uncertainty  of  h  derived  by  the  locus  algorithm. 
In  practice,  the  distance  D(h )  =  (d(h)  -  d'(h))2/2o(d'(h))z  is  computed  for  several  values  of  h  close  to 
ho.  the  variance  <7*  is  computed  by  fitting  the  function  (h  -  h0)zj2a\  to  the  values  of  D{h).  This  is  a 
first  order  model  of  the  uncertainty  in  the  sense  that  it  takes  into  account  the  uncertainty  on  the  sensor 
measurements,  but  it  does  not  include  the  uncertainty  due  to  the  locus  algorithm  itself,  in  particular  the 
errors  introduced  by  the  interpolation. 

3.5.5  Detecting  the  range  shadows 

As  we  pointed  out  in  Section  3.1,  the  terrain  may  exhibit  range  shadows  in  the  elevation  map.  It  is 
important  to  identify  the  shadow  regions  because  the  terrain  may  have  any  shape  within  the  boundaries 
of  the  shadows,  whereas  the  surface  would  be  smoothly  interpolated  if  we  applied  the  locus  algorithm 
directly  in  those  areas.  This  may  result  in  dangerous  situations  for  the  robot  if  a  path  crosses  one  of  the 
range  shadows.  A  simple  idea  would  be  to  detect  empty  regions  in  the  raw  elevation  map,  which  are  the 
projection  of  images  in  the  map  without  any  interpolation.  This  approach  does  not  work  because  the  size 
of  the  shadow  regions  may  be  on  the  order  of  the  average  instance  between  data  points.  This  is  especially 
true  for  shadows  that  are  at  some  distance  from  the  sensor  in  which  case  the  distribution  of  data  points 
is  very  sparse.  It  is  possible  to  modify  the  standard  locus  algorithm  so  that  it  takes  into  account  the 
shadow  areas.  The  basic  idea  is  that  a  range  shadow  corresponds  to  a  strong  occluding  edge  in  the  image 
(Figure  12).  An  Ct.y)  location  in  the  map  is  in  a  shadow  area  if  its  locus  intersects  the  image  at  a  pixel 


that  lies  on  such  an  edge  (Figure  24). 


Figure  24:  Detecting  range  shadows 


We  implement  this  algorithm  by  first  detecting  the  edges  in  the  range  image  by  using  a  standard 
technique,  the  GNC  algorithm  [6].  We  chose  this  algorithm  because  it  allows  us  to  vary  the  sensitivity 
of  the  edge  detector  across  the  image,  and  because  it  performs  some  smoothing  of  the  image  as  a  side 
effect.  When  we  apply  the  locus  algorithm  we  can  then  record  the  fact  that  the  locus  of  a  given  location 
intersects  the  image  at  an  edge  pixel.  Such  map  locations  are  grouped  into  regions  that  are  the  reported 
range  shadows.  Figure  25  shows  an  overhead  view  of  an  elevation  map  computed  by  the  locus  algorithm, 
the  white  points  are  the  shadow  points,  the  gray  level  of  the  other  points  is  proportional  to  their  uncertainty 
as  computed  in  the  previous  Section. 

3.5.6  An  application:  footfall  selection  for  a  legged  vehicle 

The  purpose  of  using  the  locus  algorithm  for  building  terrain  is  to  provide  high  resolution  elevation  data. 
As  an  example  of  an  application  in  which  such  a  resolution  is  needed,  we  briefly  describe  in  this  Section 
the  problem  of  perception  for  a  legged  vehicle  [24],  One  of  the  main  responsibilities  of  perception  for 
a  legged  vehicle  is  to  provide  a  terrain  description  that  enables  the  system  to  determine  whether  a  given 
foot  placement,  or  footfall,  is  safe.  In  addition,  we  consider  the  case  of  locomotion  on  very  rugged  terrain 
such  as  the  surface  of  Mars. 

A  foot  is  modeled  by  a  flat  disk  of  diameter  30  cms.  The  basic  criterion  for  footfall  selection  is  to 
select  a  footfall  area  with  the  maximum  support  area  which  is  defined  as  the  contact  area  between  the  foot 
and  the  terrain  as  shown  in  Figure  2A  Another  constraint  for  footfall  selection  is  that  the  amount  of  energy 
necessary  to  penetrate  the  ground  in  order  to  achieve  sufficient  support  area  must  be  minimized.  The 
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Figure  25:  Shadow  regions  in  an  elevation  map 
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Figure  26:  Footfall  support  area 
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energy  is  proportional  to  the  depth  of  the  foot  in  the  ground.  The  support  area  is  estimated  by  counting 
the  number  of  map  points  within  the  circumference  of  the  disk  that  are  above  the  plane  of  the  foot.  This 
is  where  the  resolution  requirement  originates  because  the  computation  of  the  support  area  makes  sense 
only  if  the  resolution  of  the  map  is  significantly  smaller  than  the  diameter  of  the  foot.  Given  a  minimum 
allowed  support  area,  Smm,  and  the  high  resolution  terrain  map,  we  can  find  the  optimal  footfall  position 
within  a  given  terrain  area:  First,  we  want  to  find  possible  flat  areas  by  computing  surface  normals  for 
each  footfall  area  in  a  specified  footfall  selection  area.  Footfalls  with  a  high  surface  normal  are  eliminated. 
The  surface  normal  analysis,  however,  will  not  be  sufficient  for  optimal  footfall  selection.  Second,  the 
support  area  is  computed  for  the  remaining  positions.  The  optimal  footfall  position  is  the  one  for  which 
the  maximum  elevation,  hopt  that  realizes  the  minimum  support  area  Smin  is  the  maximum  across  the  set 
of  possible  footfall  positions.  Figure  27  shows  a  plot  of  the  surface  area  with  respect  to  the  elevation 
from  which  hm„  can  be  computed. 


Figure  27:  Support  area  versus  elevation 


3.5.7  Extracting  local  features  from  an  elevation  map 

The  high  resolution  map  enables  us  to  extract  very  local  features,  such  as  points  of  high  surface  curvature, 
as  opposed  to  the  larger  terrain  patches  of  Section  3.4.  The  local  features  that  we  extract  are  based  on 
the  magnitude  of  the  two  principal  curvatures  of  the  terrain  surface.  The  curvatures  are  computed  as 
in  [34]  by  first  smoothing  the  map,  and  then  computing  the  derivatives  of  the  surface  for  solving  the  first 
fundamental  form.  Figure  28  shows  the  curvature  images  computed  from  an  elevation  map  using  the 
locus  algorithm.  The  resolution  of  the  map  is  ten  centimeters.  Points  of  high  curvature  correspond  to 
edges  of  the  terrain,  such  as  the  edges  of  a  valley,  or  to  sharp  terrain  features  such  as  hills,  or  holes.  In 
any  case,  the  high  curvature  points  are  viewpoint-independent  features  that  can  be  used  for  matching.  We 
extract  the  high  curvature  points  from  both  images  of  principal  curvature.  We  group  the  extracted  points 
into  regions,  then  classify  each  region  as  point  feature,  line,  or  region,  according  to  its  size,  elongation, 
and  curvature  distribution.  Figure  28  shows  the  high  curvature  points  extracted  from  an  elevation  map. 
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The  two  images  correspond  to  the  two  principal  curvatures.  Figure  29  shows  the  three  types  of  local 
features  detected  on  the  map  of  Figure  28  superimposed  in  black  over  the  original  elevation  map.  The 
Figure  shows  that  while  some  features  correspond  merely  to  local  extrema  of  the  surface,  some  such  as 
the  edges  of  the  deep  gully  are  characteristic  features  of  the  scene.  This  type  of  feature  extraction  plays 
an  important  role  in  Section  4  for  combining  multiple  maps  computed  by  the  locus  algorithm. 
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Figure  28:  The  high  curvature  points  of  an  elevation  map 


4  Combining  multiple  terrain  maps 

We  have  so  far  addressed  the  problem  of  building  a  representation  of  the  environment  from  sensor  data 
collected  at  one  fixed  location.  In  the  case  of  mobile  robots,  however,  we  have  to  deal  with  a  stream  of 
images  taken  along  the  vehicle’s  path.  We  could  ignore  this  fact  and  process  data  from  each  viewpoint  as 
if  it  were  an  entirely  new  view  of  the  world,  thus  forgetting  whatever  information  we  may  have  extracted 
at  past  locations.  It  has  been  observed  that  this  approach  is  not  appropriate  for  mobile  robot  navigation, 
and  that  there  is  a  need  for  combining  the  representations  computed  from  different  vantage  points  into  a 
coherent  map.  Although  this  has  been  observed  first  in  the  context  of  indoor  mobile  robots  [13,15],  the 
reasoning  behind  it  holds  true  in  our  case.  First  of  all,  merging  representations  from  successive  viewpoints 
will  produce  a  map  with  more  information  and  better  resolution  than  any  of  the  individual  maps.  For 
example,  a  tall  object  observed  by  a  range  sensor  creates  an  unknown  area  behind  it,  the  range  shadow, 
where  no  useful  information  ji  be  extracted  (Section  3.1).  The  shape  and  position  of  the  range  shadow 
changes  as  we  move  to  anou.er  location;  merging  images  from  several  locations  will  therefore  reduce 
the  size  of  the  shadow,  thus  providing  a  more  complete  description  to  the  path  planner  (Figure  30). 
Another  reason  why  merging  maps  increases  the  resolution  of  the  resulting  representation  concerns  the 
fact  that  the  resolution  of  an  elevation  map  is  significantly  better  at  close  range.  By  merging  maps,  we 
can  increase  the  resolution  of  the  parts  of  the  elevation  map  that  were  originally  measured  at  a  distance 
from  the  vehicle. 

The  second  motivation  for  merging  maps  is  that  the  position  of  the  vehicle  at  any  given  time  is 
uncertain.  Even  when  using  expensive  positioning  systems,  we  have  to  assume  that  the  robot’s  idea  of 
its  position  in  the  world  will  degrade  in  the  course  of  a  long  mission.  One  way  to  solve  this  problem 
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is  to  compute  the  position  with  respect  to  features  observed  in  the  world  instead  of  a  fixed  coordinate 
system  [37,30],  That  requires  the  identification  and  fusion  of  common  features  between  successive 
observations  in  order  to  estimate  the  displacement  of  the  vehicle  (Figure  31).  Finally,  combining  maps  is 
a  mission  requirement  in  the  case  of  an  exploration  mission  in  which  the  robot  is  sent  into  an  unknown 
territory  to  compile  a  map  of  the  observed  terrain. 

Reduced  range  shadow 
from  the  combination  of  1  and  2 


position  2 

Figure  30:  Reducing  the  range  shadow 

Many  new  problems  arise  when  combining  maps:  representation  of  uncertainty,  data  structures  for 
combined  maps,  predictions  from  one  observation  to  the  next  etc.  We  shall  focus  on  the  terrain  matching 
problem,  that  is  the  problem  of  finding  common  features  or  common  parts  between  terrain  maps  so  that 
we  can  compute  the  displacement  of  the  vehicle  between  the  two  corresponding  locations  and  then  merge 
the  corresponding  portions  of  the  terrain  maps.  We  always  make  the  reasonable  assumption  that  a  rough 
estimate  of  the  displacement  is  available  sinc~  an  estimate  can  always  be  computed  either  from  dead 
reckoning  or  from  past  terrain  matchings. 

4.1  The  terrain  matching  problem:  iconic  vs.  feature-based 

In  the  terrain  matching  problem,  as  in  any  problem  in  which  correspondences  between  two  sets  of  data 
must  be  found,  we  can  choose  one  of  two  approaches:  feature-based  or  iconic  matching.  In  feature-based 
matching,  we  first  have  to  extract  two  sets  of  features  (F- )  and  (Fj)  from  the  two  views  to  be  matched,  and 
to  find  correspondences  between  features,  (Fjt.Fj)  that  are  globally  consistent.  We  can  then  compute  the 
displacement  between  the  two  views  from  the  parameters  of  the  features,  and  finally  merge  them  into  one 
common  map.  Although  this  is  the  standard  approach  to  object  recognition  problems  [5],  it  has  also  been 
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Feature  observed 
from  position  1 


Feature  observed 
from  position  2 


Position  1  Regions  of  Position  2 

position  uncertainty 


Figure  31:  Matching  maps  for  position  estimation 

widely  used  for  map  matching  for  mobile  robots  [13,23,30,7,1,41].  In  contrast,  iconic  approaches  work 
directly  on  the  two  sets  of  data  points,  Pl  and  P2  by  minimizing  a  cost  function  of  the  form  F(T(P2).  Px  ) 
where  T(P2)  is  the  set  of  points  from  view  2  transformed  by  a  displacement  T.  The  cost  is  designed  so 
that  its  minimum  corresponds  to  a  "best"  estimate  of  T  in  some  sense.  The  minimization  of  F  leads  to  an 
iterative  gradient-like  algorithm.  Although  less  popular,  iconic  techniques  have  been  successfully  applied 
to  incremental  depth  estimation  [30,29]  and  map  matching  [40,12]. 

The  proponents  of  each  approach  have  valid  arguments.  The  feature -based  approach  requires  a  search 
in  the  space  of  possible  matches  which  may  lead  to  a  combinatorial  explosion  of  the  matching  program. 
On  the  other  hand,  iconic  approaches  are  entirely  predictable  in  terms  of  computational  requirements 
but  are  usually  quite  expensive  since  the  size  of  the  points  sets  P‘  is  typically  on  the  order  of  several 
thousands.  As  for  the  accuracy  of  the  resulting  displacement  T,  the  accuracy  of  iconic  techniques  can  be 
better  than  the  resolution  of  the  sensors  if  we  iterate  the  minimization  of  F  long  enough,  while  any  feature 
extraction  algorithm  loses  some  of  the  original  sensor  accuracy.  Furthermore,  feature  matching  could  in 
theory  be  used  even  if  no  a-priori  knowledge  of  T,  To,  is  available  while  iconic  approaches  require  To  to 
be  close  to  the  actual  displacement  because  of  the  iterative  nature  of  the  minimization  of  F. 

Keeping  these  tenets  in  mind,  we  propose  to  combine  both  approaches  into  one  terrain  matching 
algorithm.  The  basic  idea  is  to  use  the  feature  matching  to  compute  a  first  estimate  t  given  a  rough  initial 
value  To,  and  then  to  use  an  iconic  technique  to  compute  an  accurate  estimate  f.  This  has  the  advantage 
of  retaining  the  level  of  accuracy  of  iconic  techniques  while  keeping  the  computation  time  of  the  iconic 
stage  under  control  because  the  feature  matching  provides  an  estimate  close  enough  to  the  true  value.  Wc 
describe  in  detail  the  feature-based  and  iconic  stages  in  the  next  three  sections. 
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4.2  Feature-based  matching 

Let  Fj  and  Fj  be  two  sets  of  features  extracted  from  two  images  of  an  outdoor  scene,  I\  and  h.  We 
want  to  find  a  transformation  f  and  a  set  of  pairs  C*  =  ( F}k,Fjt)  such  that  Fjt  ss  t(F}k),  where  T(F) 
denotes  the  transformed  by  T  of  a  feature  F.  The  features  can  be  any  of  those  discussed  in  the  previous 
Sections:  points  or  lines  from  the  local  feature  extractor,  obstacles  represented  by  a  ground  polygon,  or 
terrain  patches  represented  by  their  surface  equation  and  their  polygonal  boundaries.  We  first  investigate 
the  feature  matching  algorithm  independently  of  any  particular  feature  type  so  that  we  can  then  apply  it 
to  any  level  of  terrain  representation. 

For  each  feature  Fj,  we  can  first  compute  the  set  of  features  Ffj  that  could  correspond  to  Fj  given 
an  initial  estimate  Jo  of  the  displacement.  The  Fjj's  should  lie  in  a  prediction  region  centered  at  To(Fj). 
The  size  of  the  prediction  region  depends  on  the  confidence  we  have  in  To  and  in  the  feature  extractors. 
For  example,  the  centers  of  the  polygonal  obstacles  of  Section  3.4  are  not  known  accurately,  while  the 
curvature  points  from  Section  3.5.7.  can  be  accurately  located.  The  confidence  on  the  displacement  T  is 
represented  by  the  maximum  distance  6  between  a  point  in  image  1  and  the  transformed  of  its  homologue 
in  image  2,  \\Tp 2  —  z?1 1| ,  and  by  the  maximum  angle  e,  between  a  vector  in  image  2  and  the  transformed 
of  its  homologue  in  image  1  by  the  rotation  part  of  T.  The  prediction  is  then  defined  as  the  set  of  features 
that  are  at  a  Cartesian  distance  lower  than  6,  and  at  an  angular  distance  lower  than  e  from  To(Ff).  The 
parameters  used  to  determine  if  a  feature  belongs  to  a  prediction  region  depend  on  the  type  of  that  feature. 
For  example,  we  use  the  direction  of  a  line  for  the  test  on  the  angular  distance,  while  the  center  of  an 
obstacle  is  used  for  the  test  on  the  Cartesian  distance.  Some  features  may  be  tested  only  for  orientation, 
such  as  lines,  or  only  for  position,  such  as  point  features.  The  features  in  each  prediction  region  are 
sorted  according  to  some  feature  distance  d(Fj ,  To(Ffj))  that  reflects  how  well  the  features  are  matched. 
The  feature  distance  depends  also  on  the  type  of  the  feature:  for  points  we  use  the  usual  distance,  for 
lines  we  use  the  angles  between  the  directions,  and  for  polygonal  patches  (obstacles  or  terrain  patches) 
we  use  a  linear  combination  of  the  distance  between  the  centers,  the  difference  between  the  areas,  the 
angle  between  the  surface  orientations,  and  the  number  of  neighboring  patches.  The  features  in  image  1 
are  also  sorted  according  to  an  "importance"  measure  that  reflects  how  important  the  features  are  for  the 
matching.  Such  importance  measures  include  the  length  of  the  lines,  the  strength  of  the  point  features 
(i.e.  the  curvature  value)  ,  and  the  size  of  the  patches.  The  importance  measure  also  includes  the  type  of 
the  features  because  some  features  such  as  obstacles  are  more  reliably  detected  than  others,  such  as  point 
features. 

Once  we  have  built  the  prediction  regions,  we  can  search  for  matches  between  the  two  images.  The 
search  proceeds  by  matching  the  features  Fj  to  the  features  F2  that  are  in  their  prediction  region  starting 
at  the  most  important  feature.  We  have  to  control  the  search  in  order  to  avoid  a  combinatorial  explosion 
by  taking  advantage  of  the  fact  that  each  time  a  new  match  is  added  both  the  displacement  and  the  future 
matches  are  further  constrained.  The  displacement  is  constrained  by  combining  the  current  estimate  T 
with  the  displacement  computed  from  a  new  match  (Fj .  Ffj).  Even  though  the  displacement  is  described 
by  six  components,  the  number  of  components  of  the  displacement  that  can  be  computed  from  one  single 
match  depends  on  the  type  of  features  involved:  point  matches  provide  only  three  components,  line 
matches  provide  four  components  (two  rotations  and  two  translations),  and  region  matches  provide  three 
components.  We  therefore  combine  the  components  of  T  with  those  components  of  the  new  match  that 
can  be  computed.  A  given  match  prunes  the  search  by  constraining  the  future  potential  matches  in  two 


ways:  if  connectivity  relations  between  features  are  available,  as  in  the  case  of  terrain  patches,  then  a 
match  (£•  ,E2)  constrains  the  possible  matches  for  the  neighbors  of  Fj)  in  that  they  have  to  be  adjacent  to 
Fjj.  In  the  case  of  points  or  patches,  an  additional  constraint  is  induced  by  the  relative  placement  of  the 
features  in  the  scene:  two  matches,  (E-,E|)  and  (Fj, , Fjj, ),  are  compatible  only  if  the  angle  between  the 

vectors  w1  =  F},F-  and  w2  =  e|^e|  is  lower  than  x,  provided  the  rotation  part  of  T  is  no  greater  than  x 
which  is  the  case  in  realistic  situations.  This  constraint  means  that  the  relative  placement  of  the  features 
remains  the  same  from  image  to  image  which  is  similar  to  the  classical  ordering  constraint  used  in  stereo 
matching. 

The  result  of  the  search  is  a  set  of  possible  matchings,  each  of  which  is  a  set  of  pairs  5  =  ( F}k,Fjk)k 
between  the  two  sets  of  features.  Since  we  evaluated  T  simply  by  combining  components  in  the  course 
of  the  search,  we  have  to  evaluate  T  for  each  S  in  order  to  get  an  accurate  estimate.  T  is  estimated  by 
minimizing  an  error  function  of  the  form: 

E  =  £  d(F}t  -  7(E2  ))  (20) 

k 

The  distance  d(.)  used  in  Equation  (20)  depends  on  the  type  of  the  features  involved:  For  point  features, 
it  is  the  usual  distance  between  two  points;  for  lines  it  is  the  weighted  sum  of  the  angle  between  the  two 
lines  and  the  distance  between  the  distance  vectors  of  the  two  lines;  for  regions  it  is  the  weighted  sum  of 
the  distance  between  the  unit  direction  vectors  and  the  distance  between  the  two  direction  vectors.  All  the 
components  of  T  can  be  estimated  in  general  by  minimizing  £.  We  have  to  carefully  identify,  however, 
the  cases  in  which  insufficient  features  are  present  in  the  scene  to  fully  constrain  the  transformation.  The 
matching  S  that  realizes  the  minimum  £  is  reported  as  the  final  match  between  the  two  maps  while  the 
corresponding  displacement  f  is  reported  as  the  best  estimate  of  the  displacement  between  the  two  maps. 
The  error  E(f)  can  then  be  used  to  represent  the  uncertainty  in  T. 

This  approach  to  feature  based  matching  is  quite  general  so  that  we  can  apply  it  to  many  different 
types  of  features,  provided  that  we  can  define  the  distance  d(.)  in  Equation  (20),  the  importance  measure, 
and  the  feature  measure.  The  approach  is  also  fairly  efficient  as  long  as  6  and  e  do  not  become  too  large, 
in  which  case  the  search  space  becomes  itself  large.  We  describe  two  implementations  of  the  feature 
matching  algorithm  in  the  next  two  Sections. 

4.2.1  Example:  Matching  polygonal  representations 

We  have  implemented  the  feature-based  matching  algorithm  on  the  polygonal  descriptions  of  Section  3.4 
and  3.3.  The  features  are  in  this  case: 

•  The  polygons  describing  the  terrain  parametrized  by  their  areas,  the  equation  of  the  underlying 
surface,  and  the  center  of  the  region 

•  The  polygons  describing  the  trace  of  the  major  obstacles  detected  (if  any). 

•  The  road  edges  found  in  the  reflectance  images  if  the  road  detection  is  reliable  enough.  The 
reliability  is  measured  by  how  much  a  pair  of  road  edges  deviates  from  the  pair  found  in  the 
previous  image. 
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Figure  32:  A  sequence  of  range  and  reflectance  images 


The  obstacle  polygons  have  a  higher  weight  in  the  search  itself  because  their  detection  is  more  reliable 
than  the  terrain  segmentation,  while  the  terrain  regions  and  the  road  edges  contribute  more  to  the  final 
estimate  of  the  displacement  since  their  localization  is  better.  Once  a  set  of  matches  and  a  displacement  T 
are  computed,  the  obstacles  and  terrain  patches  that  are  common  between  the  current  map  and  a  new  image 
are  combined  into  new  polygons,  the  new  features  are  added  to  the  map  while  updating  the  connectivity 
between  features. 

This  application  of  the  feature  matching  has  been  integrated  with  the  rest  of  the  Navlab  system.  In 
the  actual  system,  the  estimates  of  the  displacement  To  are  taken  from  the  central  database  that  keeps 
track  of  the  vehicle’s  position.  The  size  of  prediction  region  is  fixed  with  6  =  one  meter,  and  e  =  20°. 
This  implementation  of  the  feature  matching  has  performed  successfully  over  the  course  of  runs  of  several 
hundred  meters.  The  final  product  of  the  matching  is  a  map  that  combines  all  the  observations  made 
during  the  run,  and  a  list  of  updated  obstacle  descriptions  that  are  sent  to  a  map  module  at  regular  intervals. 
Since  errors  in  determining  position  tend  to  accumulate  during  such  long  mns,  we  always  keep  the  map 
centered  around  the  current  vehicle  position.  As  a  result,  the  map  representation  is  always  accurate 
close  to  the  current  vehicle  position.  As  an  example,  Figure  34  shows  the  result  of  the  matching  on  five 
consecutive  images  separated  by  about  one  meter.  The  scene  in  this  case  is  a  road  bordered  by  a  few  trees. 
Figure  32  shows  the  original  sequence  of  raw  range  and  reflectance  images,  Figure  33  shows  perspective 
views  of  the  corresponding  individual  maps,  and  Figure  34  is  a  rendition  of  the  combined  maps  using  the 
displacement  and  matches  computed  from  the  feature  matching  algorithm.  This  last  display  is  a  view  of 
the  map  rotated  by  45°  about  the  x  axis  and  shaded  by  the  values  from  the  reflectance  image. 

4.2.2  Example:  Matching  local  features  from  high  resolution  maps 

Matching  local  features  from  high  resolution  maps  provides  the  displacement  estimate  for  the  iconic 
matching  of  high  resolution  maps.  The  primitives  used  for  the  matching  are  the  high  curvature  points  and 
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Figure  33:  Individual  maps 


Figure  34:  Perspective  view  of  the  combined  map 


lines  described  in  Section  3.5.7.  The  initial  matches  are  based  on  the  similarity  of  the  length  of  the  lines 
and  the  similarity  of  the  curvature  strength  of  the  points.  The  search  among  candidate  matches  proceeds 
as  described  in  Section  4.2.  Since  we  have  dense  elevation  at  our  disposal  in  this  case,  we  can  evaluate 
a  candidate  displacement  over  the  entire  map  by  summing  up  the  squared  differences  between  points  in 
one  map  and  points  in  the  transformed  map.  Figure  35  shows  the  result  of  the  feature  matching  on  a  pair 
of  maps.  The  top  image  shows  the  superimposition  of  the  contours  and  features  of  the  two  maps  using 
the  estimated  displacement  (about  one  meter  translation  and  4°  rotation),  while  the  bottom  image  shows 
the  correspondences  between  the  point  and  line  features  in  the  two  maps.  The  lower  map  is  transformed 
by  T  with  respect  to  the  lower  right  map.  Figure  36  shows  the  result  of  the  feature  matching  in  a  case  in 
which  the  maps  are  separated  by  a  very  large  displacement.  The  lower  left  display  shows  the  area  that 
is  common  between  the  two  maps  after  the  displacement.  Even  though  the  resulting  displacement  is  not 
accurate  enough  to  reliably  merge  the  maps,  it  is  close  enough  to  the  optimum  to  be  used  as  the  starting 
point  of  a  minimization  algorithm. 


4.3  Iconic  matching  from  elevation  maps 

The  general  idea  of  the  iconic  matching  algorithm  is  to  find  the  displacement  T  between  two  elevation 
maps  from  two  different  range  images  that  minimizes  an  error  function  computed  over  the  entire  combined 
elevation  map.  The  error  function  E  measures  how  well  the  first  map  and  the  transformed  of  the  second 
map  by  T  do  agree.  The  easiest  formulation  for  E  is  the  sum  of  the  squared  differences  between  the 
elevation  at  a  location  in  the  first  map  and  the  elevation  at  the  same  location  computed  from  the  second 
map  using  T.  To  be  consistent  with  the  earlier  formulation  of  the  locus  algorithm,  the  elevation  at  any 
point  of  the  first  map  is  actually  the  intersection  of  a  line  containing  this  point  with  the  range  image.  We 
need  some  additional  notations  to  formally  define  E:  R  and  t  denote  the  rotation  and  translation  parts  of 
T  respectively,  f(u,  v)  is  the  function  that  maps  a  line  in  space  described  by  a  point  and  a  unit  vector  to 
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a  point  in  by  the  generalized  locus  algorithm  of  Section  3.5.2  applied  to  image  i.  We  have  then: 


£  =  ^|l/-,(u,v)-g(u,v,r)||2  (21) 

where  g(u,  v,  T)  is  the  intersection  of  the  transformed  of  the  line  (u,  v)  by  T  with  image  2  expressed  in  the 
coordinate  system  of  image  1  (Figure  37).  The  summation  in  Equation  (21)  is  taken  over  all  the  locations 
(u,  v)  in  the  first  map  where  both  fi(u,v)  and  g(u,v,  T)  are  defined.  The  lines  («,  v)  in  the  first  map  are 
parallel  to  the  z-axis.  In  other  words: 

g(u,  v,T)  =  T'  (f2(u!,  V))  =  R'f2(ul,  v')  +  {  (22) 

where  T~l  =  (R',f)  =  (R~l  ,-R~lt)  is  the  inverse  transformation  of  T,  and  (o',  V)  =  (Ru  +  t,Rv )  is  the 
transformed  of  the  line  ( u ,  v).  This  Equation  demonstrates  one  of  the  reasons  why  the  locus  algorithm 
is  powerful:  in  order  to  compute  f2(Ru  +  t,Rv)  we  can  apply  directly  the  locus  algorithm,  whereas  we 
would  have  to  do  some  interpolation  or  resampling  if  we  were  using  conventional  grid-based  techniques. 
We  can  also  at  this  point  fully  justify  the  formulation  of  the  generalized  locus  algorithm  in  Section  3.5.2: 
The  transformed  line  (u\  v')  can  be  anywhere  in  space  in  the  coordinate  system  of  image  2,  even  though 
the  original  line  ( u ,  v)  is  parallel  to  the  z-axis,  necessitating  the  generalized  locus  algorithm  to  compute 

/2(«'y). 


We  now  have  to  find  the  displacement  T  for  which  E  is  minimum.  If  v  =  [a,/?, 7 ,tx,ty,tz]1  is  the 
6-vector  of  parameters  of  T,  where  the  first  three  components  are  the  rotation  angles  and  the  last  three 
are  the  components  of  the  translation  vector,  then  E  reaches  a  minimum  when: 


Assuming  an  initial  estimate  l'o,  such  a  minimum  can  be  found  by  an  iterative  gradient  descent  ol 
the  foim: 


/+1  =  ul  +  k^-(ul) 
du 


(24) 


where  i /  is  the  estimate  of  u  at  iteration  i.  From  Equation  (21),  the  derivative  of  E  can  be  computed  by: 


8E 

du 


=  -2  £(fi («,  v)  -  g(u,  v,  T))^(u,  v,  T) 


(25) 


From  Equation  (22),  we  get  the  derivative  of  g: 


(26) 


The  derivatives  appeasing  in  the  last  two  components  in  Equation  (26)  are  the  derivatives  of  the 
transformation  with  respect  to  its  parameters  which  can  be  computed  analytically.  The  last  step  to 
compute  the  derivative  of  g(u,v,T)  is  therefore  to  compute  the  derivative  of  fifu',  v1)  with  icspect  to  u. 
We  could  write  the  derivative  with  respect  to  each  component  u,  of  u  by  applying  the  chain  rule  directly: 


=  —  ~  d/2  dv' 

du,  U  ’  du  du,  +  dv  du. 


(27) 


Equation  (27)  leads  however  to  unstabilities  in  the  gradient  algorithm  because,  as  we  pointed  out  in 
Section  3.5.2,  the  (u,  v)  representation  is  an  ambiguous  representation  of  lines  in  space.  We  need  to  use  a 
non  ambiguous  representation  in  order  to  correctly  compute  the  derivative.  According  to  equation  (13),  we 
can  use  interchangeably  the  («,  v)  representation  and  the  unambiguous  (a,  b,p,  q)  representation.  Therefore 
by  considering /2  as  a  function  of  the  transform  by  T,  /'  =  (o',  £',//,  </),  of  a  line  /  =  ( a,b,p,q )  in  image 
1,  we  can  transform  Equation  (27)  to: 


.  dh_  dT_ 

dvi  dt  du, 


(28) 


Since  the  derivative  dfajdt  depends  only  on  the  data  in  image  2,  we  cannot  compute  it  analytically 
and  have  to  estimate  it  from  the  image  data.  We  approximate  the  derivatives  of fz  with  respect  to  a,b,p, 
and  q  by  differences  of  the  type: 

dh_  _  f(a  + Aa,b,p,q)  -  f(a,b,p,q ) 
da  Aa 

Approximations  such  as  Equation  (29)  work  well  because  the  combination  of  the  locus  algorithm  and  the 
GNC  image  smoothing  produces  smooth  variations  of  the  intersection  points. 

The  last  derivatives  that  we  have  to  compute  to  complete  the  evaluation  of  dE/du  are  the  derivatives 
of  (  with  respect  to  each  motion  parameter  i We  start  by  observing  that  if  X  =  [x,y,  z]'  is  a  point  on 
the  line  of  parameter  /,  and  X1  =  [x/,y,z/]'  is  the  transformed  of  X  by  T  that  lies  on  a  line  of  parameter 
/',  then  we  have  the  following  relations  from  Equation  (13): 


x=  az  +  p,^  =  aV  +  p'  (30) 

y  =  bz  +  q,y  -b'zl^q1 
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By  eliminating  X  "d  X'  between  Equation  (30)  and  the  relation  X'  =  RX+ 1,  we  have  the  relation  between 
/  and  /': 

=  p'  =  Rx.U  +  tx-d{Rz.U  +  tz)  (31) 

KZ.V 

y  =  irr,'  q'  =  RyU+ty-mt.U+tz) 

KZ.V 

where  Rz,Ry,Rz  are  the  row  vectors  of  the  rotation  matrix  R,  A  -  [a,b,  1]',  B  =  [/?,<?,  0]'.  We  now  have  /' 
as  a  function  of  l  and  T,  making  it  easy  to  compute  the  derivatives  with  respect  to  p,  from  Equation  (31). 

In  the  actual  implementation  of  the  matching  algorithm,  the  points  at  which  the  elevation  is  computed 
in  the  first  map  are  distributed  on  a  square  grid  of  ten  centimeters  resolution.  The  lines  (u,  v)  are  therefore 
vertical  and  pass  through  the  centers  of  the  grid  cells.  £  is  normalized  by  the  number  of  points  since 
the  since  of  the  overlap  region  between  the  two  maps  is  not  known  in  advance.  We  first  compute  the 
f\(u,  v)  for  the  entire  grid  for  image  1,  and  then  apply  directly  the  gradient  descent  algorithm  described 
above.  The  iterations  stop  either  when  the  variation  of  error  AE  is  small  enough,  or  when  E  itself  is 
small  enough.  Since  the  matching  is  computationally  expensive,  we  compute  E  over  an  eight  by  eight 
meter  window  in  the  first  image.  The  last  test  ensures  that  we  do  not  keep  iterating  if  the  error  is  smaller 
than  what  can  be  reasonably  achieved  given  the  characteristics  of  the  sensor.  Figure  38  shows  the  result 
of  combining  three  high  resolution  elevation  maps.  The  displacements  between  maps  are  computed  using 
the  iconic  matching  algorithm.  The  maps  are  actually  combined  by  replacing  the  elevation  fi(u,  v)  by  the 
combination: 


(J\f\  +  <72/2 

C 71  +  <72 


(32) 


where  <7 1  and  U2  are  the  uncertainty  values  computed  as  in  Section  3.5.4.  Equation  (32)  is  derived  by 
considering  the  two  elevation  values  as  Gaussian  distributions.  The  resulting  mean  error  in  elevation  is 
lower  than  ten  centimeters.  We  computed  the  initial  To  by  using  the  local  feature  matching  of  Section  4.2.2. 
This  estimate  is  sufficient  to  ensure  the  convergence  to  the  true  value.  This  is  important  because  the 
gradient  descent  algorithm  converges  towards  a  local  minimum,  and  it  is  therefore  important  to  show 
that  To  is  close  to  the  minimum.  Figure  39  plots  the  value  of  the  vC s  with  respect  to  the  number  of 
iterations.  These  curves  show  that  E  converges  in  a  smooth  fashion.  The  coefficient  k  that  controls  the 
rate  of  convergence  is  very  conservative  in  this  case  in  order  to  avoid  oscillations  about  the  minimum. 

Several  variations  of  the  core  iconic  matching  algorithm  are  possible.  First  of  all,  we  assumed 
implicitly  that  £  is  a  smooth  function  of  v\  this  not  true  in  general  because  the  summation  in  Equation  (21) 
is  taken  only  over  the  regions  in  which  both  f\  and  g  are  defined,  that  is  the  intersection  of  the  regions  of 
map  1  and  2  that  is  neither  range  shadows  nor  outside  of  the  field  of  view.  Such  a  summation  implicitly 
involves  the  use  of  a  non-differentiable  function  that  is  1  inside  the  acceptable  region  and  0  outside.  This 
does  not  affect  the  algorithm  significantly  because  the  changes  in  v  from  one  iteration  to  the  next  are 
small  enough.  A  differentiable  formulation  for  £  would  be  of  the  form: 


E  =  (u,v)fi2(T{u,  v))|[fi(«,v)  -  g(u,v,T)||2 


(33) 


where  pi(u,  v)  is  a  function  that  is  at  most  1  when  the  point  is  inside  a  region  where  f(u,  v)  is  defined 
and  vanishes  as  the  point  approaches  a  forbidden  region,  that  is  a  range  shadow  or  a  region  outside  of 
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Figure  39:  Convergence  rate  of  the  matching  algorithm 


the  field  of  view.  The  summation  in  Eq.  33  is  taken  over  tne  entire  map.  In  order  to  avoid  a  situation  in 
which  the  minimum  is  attained  when  the  two  maps  do  not  overlap  (£  =  0),  we  must  also  normalize  £  by 
the  number  of  points  in  the  overlap  region.  For  £  to  be  still  smooth,  we  should  therefore  normalize  by: 

£mi  (u,v)/i2(u,v)  (34) 

In  addition  to  £  being  smooth,  we  also  assumed  that  matching  the  two  maps  entirely  determines  the 
six  parameters  of  T.  This  assumption  may  not  be  true  in  all  cases.  A  trivial  example  is  one  in  which  we 
match  two  images  of  a  flat  plane,  where  only  the  vertical  translation  can  be  computed  from  the  matching. 
The  gradient  algorithm  does  not  converge  in  those  degenerate  cases  because  the  minimum  T(u)  may  have 
arbitrarily  large  values  within  a  surface  in  parameter  space.  A  modification  of  the  matching  algorithm 
that  would  ensure  that  the  algorithm  does  converge  to  some  infinite  value  changes  Equation  (21)  to: 

e~Y1  IL/iCm,  v)  -  g(u,  v,  T) ||2  +  -W  (35) 

i 

The  effect  of  the  weights  A,  is  to  include  the  constraint  that  the  iVs  do  not  increase  to  infinity  in  the 
minimization  algorithm. 


5  Combining  range  and  intensity  data 

In  the  previous  Section  we  have  concentrated  on  the  use  of  3-D  vision  as  it  relates  solely  to  the  navigation 
capabilities  of  mobile  robots.  Geometric  accuracy  was  the  deciding  factor  in  the  choice  of  representations 
and  algorithms  while  we  gave  very  little  attention  to  the  extraction  of  semantic  information.  A  mobile 
robot  needs  more  than  just  navigation  capabilities,  however,  since  it  also  must  be  able  to  extract  semantic 
descriptions  from  its  sensors.  For  example,  we  will  describe  a  landmark:  recognition  algorithm  in  Section 
5.  In  that  case,  the  system  is  able  not  only  to  build  a  geometric  representation  of  an  object  but  also  to 
relate  it  to  a  stored  model. 

Extracting  semantic  information  for  landmark  recognition  or  scene  analysis  may  require  much  more 
than  just  geometric  data  from  a  range  sensor.  For  example,  interpreting  surface  markings  is  the  only  way 
to  unambiguously  recognize  traffic  signs.  Conversely,  the  recognition  of  a  complex  man-made  object  of 
uniform  color  is  easiest  when  using  geometric  information.  In  this  Section  we  address  the  problem  of 
combining  3-D  data  with  data  from  other  sensors.  The  most  interesting  problem  is  the  combination  of 
3-D  data  with  color  images  since  these  are  the  two  most  common  sensors  for  outdoor  robots.  Since  the 
sensors  have  different  fields  of  view  and  positions,  we  first  present  an  algorithm  for  transforming  the 
images  into  a  common  frame.  As  an  example  of  the  use  of  combined  range/color  images,  we  describe  a 
simple  scene  analysis  program  in  Section  5.3. 


5.1  The  geometry  of  video  cameras 

The  video  camera  is  a  standard  color  vidicon  camera  equipped  with  wide-angle  lenses.  The  color  images 
are  480  rows  by  512  columns,  and  each  band  is  coded  on  eight  bits.  The  wide-angle  lens  induces  a 
significant  geometric  distortion  in  that  the  relation  between  a  point  in  space  and  its  projection  on  the 
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image  plane  does  not  obey  the  laws  of  the  standard  perspective  transformation.  We  alleviate  this  problem 
by  first  transforming  the  actual  image  into  an  "ideal"  image:  if  ( R,Q  is  the  position  in  the  real  image, 
then  the  position  (r,  c)  in  the  ideal  image  is  given  by: 


r=fr(R,Q,c=fc(R,C)  (36) 

where  fr  and  fc  are  third  order  polynomials.  This  correction  is  cheap  since  the  right-hand  side  of  (36) 
can  be  put  in  lookup  tables.  The  actual  computation  of  the  polynomial  is  described  in  [31]  The  geometry 
of  the  ideal  image  obeys  the  laws  of  the  perspective  projection  in  that  if  P  =  [x,y,  z]'  is  a  point  in  space, 
and  (r,  c)  is  its  projection  in  the  ideal  image  plane,  then: 

r=fx/z,c=fy/z  (37) 

where  /  is  the  focal  length.  In  the  rest  of  the  paper,  row  and  column  positions  will  always  refer  to  the 
positions  in  the  ideal  image,  so  that  perspective  geometry  is  always  assumed. 


5.2  The  registration  problem 

•  Range  sensor  and  video  cameras  have  different  fields  of  view,  orientations,  and  positions.  In  order  to 

be  able  to  merge  data  from  both  sensors,  we  first  have  to  estimate  their  relative  positions,  known  as  the 
calibration  or  registration  problem  (Figure  41).  We  approach  the  problem  as  a  minimization  problem  in 
which  pairs  of  pixels  are  selected  in  the  range  and  video  images.  The  pairs  are  selected  so  that  each  pair 
is  the  image  of  a  single  point  in  space  as  viewed  from  the  two  sensors.  The  problem  is  then  to  find  the 

0  best  calibration  parameters  given  these  pairs  of  points  and  is  further  divided  into  two  steps:  we  first  use 

a  simple  linear  least-squares  approach  to  find  a  rough  initial  estimate  of  the  parameters,  and  then  apply  a 
non-linear  minimization  algorithm  to  compute  an  optimal  estimate  of  the  parameters. 
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5.2.1  The  calibration  problem  as  a  minimization  problem 

Let  P i  be  a  point  in  space,  with  coordinates  Pf  with  respect  to  the  range  sensor,  and  coordinates  Pci  with 
respect  to  the  video  camera.  The  relationship  between  the  two  coordinates  is: 

Pf  =  RPf-T  (38) 

where  R  is  a  rotation  matrix,  and  T  is  a  translation  vector.  R  is  a  non-linear  function  of  the  orientation 
angles  of  the  camera:  pan  (a),  tilt  (/?),  and  rotation  (7).  Pf  can  be  computed  from  a  pixel  location  in  the 
range  image.  Pf  is  not  completely  known,  it  is  related  to  the  pixel  position  in  the  video  image  by  the 
perspective  transformation: 


A'i  =  fA  (39) 

Aa=fyci  (40) 

where  /  is  the  focal  length.  Substituting  (38)  into  (39)  and  (40)  we  get: 

RzP-n  -  Tzn  -  fRxPf  +  TX  =  0  (41) 

RzPfc,  -  TzCi  -  fRyPf  +  ry  =  0  (42) 


where  Rx,  Ry,  and  R2  are  the  row  vectors  of  the  rotation  matrix  R,  and  T  -  fTy,  Tx  -  fTx. 

We  are  now  ready  to  reduce  the  calibration  problem  to  a  least-squares  minimization  problem.  Given 
n  points  P„  we  want  to  find  the  transformation  ( R,T)  that  minimizes  the  left-hand  sides  of  equations  (41) 
and  (42).  We  first  estimate  T  by  a  linear  least-squares  algorithm,  and  then  compute  the  optimal  estimate 
of  all  the  parameters. 

5.2.2  Initial  estimation  of  camera  position 

Assuming  that  we  have  an  estimate  of  the  orientation  R,  we  want  to  estimate  the  corresponding  T.  The 
initial  value  of  R  can  be  obtained  by  physical  measurements  using  inclinometers.  Under  these  conditions. 
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the  criterion  to  be  minimized  is: 


ft 

c  =  04,  -  rzfii  -/C,  +  rx)2  +  (D,  -  r2£;  -/£,  +  r/]  (43) 

1=1 

where  4/  =  R^n,  Bi  =  r(,  C,  =  /?*/*,  D,  =  R2P*Ci,  Ei  =  c„  and  F,  =  RyP*  are  known  and  Tz,  7^,  7^,, 
/  are  the  unknowns. 

Equation  (43)  can  be  put  in  matrix  form: 

C  =  \\U  -  AVf  -  \\W  -  BV\\2  (44) 


where  V  =  [rz,ry,Tz,f]‘,  U  =  [A{,  ..,A„]',  W  =  [Di,  A 

Ei  -1  0  F,  1 


Bi  0  - 1  C, 
Bn  0  -1  C„ 


and  B  = 


0  F„ 


The  minimum  for  the  criterion  of  Equation  (44)  is  attained  at  the  parameter  vector: 


V  =  (A‘A  +  B‘B)~  \A‘U  +  B'W ) 


(45) 


5.2.3  Optimal  estimation  of  the  calibration  parameters 

Once  we  have  computed  the  initial  estimate  of  V,  we  have  to  compute  a  more  accurate  estimate  of  (R.  T). 
Since  R  is  a  function  of  (a,  $,7),  we  can  transform  the  criterion  from  equation  (43)  into  the  form: 

C=£||/i-7/1(S)||2  (46) 

i=i 

where  /,  is  the  2-vector  representing  the  pixel  position  in  the  video  image,  /,  =  [r,,  q]‘,  and  5  is  the  full 
vector  of  parameters,  S  =  [7^,  Ty,  Tz,f,  a,  (3, 7]'.  We  cannot  directly  compute  C„in  since  the  functions  //, 
are  non-linear,  instead  we  linearize  C  by  using  the  first  order  approximation  of  //,  [27]: 

n 

C«^||/i-//1(S0)-7,4)S||2  (47) 

1=1 

where  J,  is  the  Jacobian  of  //,  with  respect  to  S,  So  is  the  current  estimate  of  the  parameter  vector,  and 
AS  =  S  -  So.  The  right-hand  side  of  (47)  is  minimized  when  its  derivative  with  respect  to  AS  vanishes, 
that  is: 

ft 

Y,AJ,AS  +  J‘AC,  =  0  (48) 

i=i 

where  ACi  =  /,  -  7/,(So).  Therefore,  the  best  parameter  vector  for  the  linearized  criterion  is: 

ft 

AS^-^Mr'AACi  (49) 

1=1 

Equation  (49)  is  iterated  until  there  is  no  change  in  S.  At  each  iteration,  the  estimate  So  is  updated 
by:  So  —  So  +  /IS. 
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5.2.4  Implementation  and  performance 

The  implementation  of  the  calibration  procedure  follows  the  steps  described  above.  Pairs  of  corresponding 
points  are  selected  in  a  sequence  of  video  and  range  images.  We  typically  use  twenty  pairs  of  points 
carefully  selected  at  interesting  locations  in  the  image  (e.g.  comers).  An  initial  estimate  of  the  camera 
orientation  is  (0,/3,0),  where  /3  is  physically  measured  using  an  inclinometer.  The  final  estimate  of  S  is 
usually  obtained  after  less  than  ten  iterations.  This  calibration  procedure  has  to  be  applied  only  once,  as 
long  as  the  sensors  are  not  displaced. 

Once  we  have  computed  the  calibration  parameters,  we  can  merge  range  and  video  images  into  a 
colored-range  image.  Instead  of  having  one  single  fusion  program,  we  implemented  this  as  a  library  of 
fusion  functions  that  can  be  divided  in  two  categories: 

1.  Range  — »  video:  This  set  of  functions  takes  a  pixel  or  a  set  of  pixels  (?,(?)  in  the  range  image 
and  computes  the  location  (r^ ,  (?)  in  the  video  image.  This  is  implemented  by  directly  applying 
Equations  (41)  and  (42). 

2.  Video  — *■  range:  This  set  of  functions  takes  a  pixel  or  a  set  of  pixels  (F,cc)  in  the  video  image 
and  computes  the  location  (r*,  (?)  in  the  range  image.  The  computed  location  can  be  used  in 
turn  to  compute  the  location  of  a  intensity  pixel  in  3-D  space  by  directly  applying  Equation  (3). 
The  algorithm  for  this  second  set  of  functions  is  more  involved  because  a  pixel  in  the  video  image 
corresponds  to  a  line  in  space  (Figure  40)  so  that  Equations  (41)  and  (42)  cannot  be  applied  directly. 
More  precisely,  a  pixel  (?,  cc )  corresponds,  after  transformation  by  (R.  T),  to  a  curve  C  in  the  range 
image.  C  intersects  the  image  at  locations  (?.(?),  where  the  algorithm  reports  the  location  (r6,  (?) 
that  is  the  minimum  among  all  the  range  image  pixels  that  lie  on  C  of  the  distance  between  (?,  cc ) 
and  the  projection  of  (r*,  (?)  in  the  video  image  (using  the  first  set  of  functions).  The  algorithm  is 
summarized  on  Figure  42. 

Figure  43  shows  the  colored-range  image  of  a  scene  of  stairs  and  sidewalks,  the  image  is  obtained  by 
mapping  the  intensity  values  from  the  color  image  onto  the  range  image.  Figure  44  shows  a  perspective 
view  of  the  colored-range  image.  In  this  example  [16],  we  first  compute  the  location  of  each  range  pixel 
( r ■*,  (?)  in  the  video  image,  and  then  assign  the  color  value  to  the  64  x  256  colored-range  image.  The  final 
display  is  obtained  by  rotating  the  range  pixels,  the  coordinates  of  which  are  computed  using  Equation  (3). 

5.3  Application  to  outdoor  scene  analysis 

An  example  of  the  use  of  the  fusion  of  range  and  video  images  is  outdoor  scene  analysis  [20,26]  in 
which  we  want  to  identify  the  main  components  of  an  outdoor  scene,  such  as  trees,  roads,  grass,  etc.  The 
colored-range  image  concept  makes  the  scene  analysis  problem  easier  by  providing  data  pertinent  to  both 
geometric  information  (e.g.  the  shape  of  the  trees)  and  physical  information  (e.g.  the  color  of  the  road). 

5.3.1  Feature  extraction  from  a  colored-range  image 

The  features  that  we  extract  from  a  colored-range  image  must  be  related  to  two  types  of  information:  the 
shapes  and  the  physical  properties  of  the  observed  surfaces. 
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Backprojection  of 

Projection  of  the  line  a  range  pixel  from  (C) 
in  range  image  in  color  image  space 


Figure  42:  Geometry  of  the  "video  — ►  range"  transformation 
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Figure  43:  Colored-range  image  of  stairs 
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Figure  44:  Perspective  view  of  registered  range  and  color  images 
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The  geometric  feature?  are  used  to  describe  the  shape  of  the  objects  in  the  scene.  We  propose  to  use 
two  types  of  features:  regions  that  correspond  to  smooth  patches  of  surface,  and  edges  that  correspond 
either  to  transitions  between  regions,  or  to  transitions  between  objects  (occluding  edges).  Furthermore, 
we  must  be  able  to  describe  the  features  in  a  compact  way.  One  common  approach  is  to  describe  the 
regions  as  quadric  patches,  and  the  edges  as  sets  of  tri-dimensional  line  segments.  More  sophisticated 
descriptions  are  possible  [5],  such  as  bicubic  patches  or  curvature  descriptors.  We  use  simpler  descriptors 
since  the  range  data  is  relatively  low  resolution,  and  we  do  not  have  the  type  of  accurate  geometric  model 
that  is  suited  for  using  higher  order  geometric  descriptors.  The  descriptors  attached  to  each  geometric 
feature  are: 

•  The  parameters  describing  the  shape  of  the  surface  patches.  That  is  the  parameters  of  the  quadric 
surface  that  approximate  each  surface  patch. 

•  The  shape  parameters  of  the  surface  patches  such  as  center,  area,  and  elongations. 

•  The  3-D  polygonal  description  of  the  edges. 

•  The  3-D  edge  types:  convex,  concave,  or  occluding. 

The  surface  patches  are  extracted  by  fitting  a  quadric  of  equation  X'AX  +  B'X  +  C  =  0  to  the  observed 
surfaces,  where  X  is  the  Cartesian  coordinate  vector  computed  from  a  pixel  in  the  range  image.  The 
fitting  error, 

E(A,B,C)=  Y,  i^AX.  +  B'Xi  +  C]2  (50) 

Xifz  patch 

is  used  to  control  the  growing  of  regions  over  the  observed  surfaces.  The  parameters  A ,  B,  C  are  computed 
by  minimizing  E(A,B,C)  as  in  [14], 

The  features  related  to  physical  properties  are  regions  of  homogeneous  color  in  the  video  image,  that 
is  regions  within  which  the  color  values  vary  smoothly.  The  choice  of  these  features  is  motivated  by  the 
fact  that  an  homogeneous  region  is  presumably  part  of  a  single  scene  component,  although  the  converse 
is  not  true  as  in  the  case  of  the  shadows  cast  by  an  object  on  an  homogeneous  patch  on  the  ground.  The 
color  homogeneity  criterion  we  use  is  the  distance  ( X  -  m)‘E~ 1  (X  -  m)  where  m  is  the  average  mean 
value  on  the  region,  S  is  the  covariance  matrix  of  the  color  distribution  over  the  region,  and  X  is  the 
color  value  of  the  current  pixel  in  (red,  green ,  blue )  space.  This  is  a  standard  approach  to  color  image 
segmentation  and  pattern  recognition.  The  descriptive  parameters  that  are  retained  for  each  region  are: 

•  The  color  statistics  (m,  27). 

•  The  polygonal  representation  of  the  region  border. 

•  Shape  parameters  such  as  center  or  moments. 

The  range  and  color  features  may  overlap  or  disagree.  For  example,  the  shadow  cast  by  an  object  on 
a  flat  patch  of  ground  would  divide  one  surface  patch  into  two  color  regions.  It  is  therefore  necessary 
to  have  a  cross-referencing  mechanism  between  the  two  groups  of  features.  This  mechanism  provides 
a  two-way  direct  access  to  the  geometric  features  that  intersect  color  features.  Extracting  the  relations 
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between  geometric  and  physical  features  is  straightforward  since  all  the  features  are  registered  in  the 
colored-range  image. 

An  additional  piece  of  knowledge  that  is  important  for  scene  interpretation  is  the  spatial  relationships 
between  features.  For  example,  the  fact  that  a  vertical  object  is  connected  to  a  large  flat  plane  through  a 
concave  edge  may  add  evidence  to  the  hypothesis  that  this  object  is  a  tree.  As  in  this  example,  we  use 
three  types  of  relational  data: 

•  The  list  of  features  connected  to  each  geometric  or  color  feature. 

•  The  type  of  connection  between  two  features  (convex/concave/occluding)  extracted  from  the  range 
data. 

•  The  length  and  strength  of  the  connection.  This  last  item  is  added  to  avoid  situations  in  which  two 
very  close  regions  become  accidentally  connected  along  a  small  edge. 

5.3.2  Scene  interpretation  from  the  colored-range  image 

Interpreting  a  scene  requires  the  recognition  of  the  main  components  of  the  scene  such  as  trees  or  roads. 
Since  we  are  dealing  with  natural  scenes,  we  cannot  use  the  type  of  geometric  matching  that  is  used  in 
the  context  of  industrial  parts  recognition  [5].  For  example,  we  cannot  assume  that  a  given  object  has 
specific  quadric  parameters.  Instead,  we  have  to  rely  on  "fuzzier"  evidence  such  as  the  verticality  of 
some  objects  or  the  flatness  of  others.  We  therefore  implemented  the  object  models  as  sets  of  properties 
that  translate  into  constraints  on  the  surfaces,  edges,  and  regions  found  in  the  image.  For  example,  the 
description  encodes  four  such  properties: 

•  FI:  The  color  of  the  trunk  lies  within  a  specific  range  =>  constraint  on  the  statistics  (m.  E)  of  a 
color  region. 

•  F2:  The  shape  of  the  trunk  is  roughly  cyclindrical  =$-  constraint  on  the  distribution  of  the  principal 
values  of  the  matrix  A  of  the  quadric  approximation. 

•  F3:  The  trunk  is  connected  to  a  flat  region  by  a  concave  edge  =>  constraint  on  the  neighbors  of 
the  surface,  and  the  type  of  the  connecting  edge. 

•  PA:  The  tree  has  two  parallel  vertical  occluding  edges  =>  constraint  on  the  3-D  edges  description. 

Other  objects  such  as  roads  or  grass  areas  have  similar  descriptions.  The  properties  PtJ  of  the  known 
object  models  Mj  are  evaluated  on  all  the  features  F*  extracted  from  the  colored-range  image.  The  result 
of  the  evaluation  is  a  score  S,y*  for  each  pair  (P,y,  F* ).  We  cannot  rely  on  individual  scores  since  some 
may  not  be  satisfied  because  of  other  objects,  or  because  of  segmentation  problems.  In  the  tree  trunk 
example,  one  of  the  lateral  occluding  edges  may  itself  be  occluded  by  some  other  object,  in  which  case 
the  score  for  F4  would  be  low  while  the  score  for  the  other  properties  would  still  be  high.  In  order  to 
circumvent  this  problem,  we  first  sort  the  possible  interpretations  Mj  for  a  given  feature  F*  according  to 
all  the  scores  (S,y),-.  In  doing  this,  we  ensure  that  all  the  properties  contribute  to  the  final  interpretation 
and  that  no  interpretations  are  discarded  at  this  stage  while  identifying  the  most  plausible  interpretations. 
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We  have  so  far  extracted  plausible  interpretations  only  for  individual  scene  features  Fk .  The  final 
stage  in  the  scene  interpretation  is  to  find  the  interpretations  ( Mjk,Fk )  that  are  globally  consistent.  For 
example,  property  P3  for  the  tree  implies  a  constraint  on  a  neighboring  region,  namely  that  this  has  to  be 
a  flat  ground  region.  Formally,  a  set  of  consistency  constraints  Cm,  is  associated  with  each  pair  of  objects 
( Mm,Mn ).  The  Cmi  constraints  are  propagated  through  the  individual  interpretations  (MJt ,  Fk)  by  using 
the  connectivity  information  stored  in  the  colored-range  feature  description.  The  propagation  is  simple 
considering  the  small  number  of  features  remaining  at  this  stage. 

The  final  result  is  a  consistent  set  of  interpretations  of  the  scene  features,  and  a  grouping  of  the 
features  into  sets  that  correspond  to  the  same  object.  The  last  result  is  a  by-product  of  the  consistency 
check  and  the  use  of  connectivity  data.  Figure  45  shows  the  color  and  range  images  of  a  scene  which 
contains  a  road,  a  couple  of  trees,  and  a  garbage  can.  Figure  46  shows  a  display  of  the  corresponding 
colored-range  image  in  which  the  white  pixels  are  the  points  in  the  range  image  that  have  been  mapped 
into  the  video  image.  This  set  of  points  is  actually  sparse  because  of  the  difference  in  resolutions  between 
the  two  sensors,  and  some  interpolation  was  performed  to  produce  the  dense  regions  of  Figure  46. 

Only  a  portion  of  the  image  is  registered  due  to  the  difference  in  field  of  view  between  the  two 
sensors  (60°  for  the  camera  versus  30°  in  the  vertical  direction  for  the  range  sensor).  Figure  47  shows 
a  portion  of  the  image  in  which  the  edge  points  from  the  range  image  are  projected  on  the  color  image. 
The  edges  are  interpreted  as  the  side  edges  of  the  tree  and  the  connection  between  the  ground  and  the 
tree.  Figure  48  shows  the  final  scene  interpretation.  The  white  dots  are  the  main  edges  found  in  the  range 
image.  The  power  of  the  colored-range  image  approach  is  demonstrated  by  the  way  the  road  is  extracted. 
The  road  in  this  image  is  separated  into  many  pieces  by  strong  shadows.  Even  though  the  shadows  do  not 
satisfy  the  color  constraint  on  road  region,  they  do  perform  well  on  the  shape  criterion  (flatness),  and  on 
the  consistency  criteria  (both  with  the  other  road  regions,  and  with  the  trees).  The  shadows  are  therefore 
interpreted  as  road  regions  and  merge  with  the  other  regions  into  one  road  region.  This  type  of  reasoning 
is  in  general  difficult  to  apply  when  only  video  data  is  used  unless  one  uses  stronger  models  of  the  objects 
such  as  an  explicit  model  of  a  shadowed  road  region.  Using  the  colored-range  image  also  makes  the 
consistency  propagation  a  much  easier  task  than  in  purely  color-based  scene  interpretation  programs  [32], 

6  Conclusion 

We  have  described  techniques  for  building  and  manipulating  3-D  terrain  representations  from  range  images. 
We  have  demonstrated  these  techniques  on  real  images  of  outdoor  scenes.  Some  of  them  (Sections  3.3,  3.4, 
and  4.2)  were  integrated  in  a  large  mobile  robot  system  that  was  successfully  tested  in  the  field.  We  expect 
that  the  module  that  manipulates  and  creates  these  terrain  representations  will  become  part  of  the  standard 
core  system  of  our  outdoor  mobile  robots,  just  as  a  local  path  planner  or  a  low-level  vehicle  controller 
are  standard  modules  of  a  mobile  robot  system  independent  of  its  application.  This  work  will  begin  by 
combining  the  polygonal  terrain  representation  of  Section  3.4  with  the  path  planner  of  [38]  in  order  to 
generate  the  basic  capabilities  for  an  off-road  vehicle. 

Many  issues  still  remain  to  be  investigated.  First  of  all,  we  must  define  a  uniform  way  of  representing 
and  combining  the  uncertainties  in  the  terrain  maps.  Currently,  the  uncertainty  models  depend  heavily  on 
the  type  of  sensor  used  and  on  the  level  at  which  the  terrain  is  represented.  Furthermore,  the  displacements 
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Figure  48:  Final  scene  interpretation 


between  terrain  maps  are  known  only  up  to  a  certain  level  of  uncertainty.  This  level  of  uncertainty  must 
be  evaluated  and  updated  through  the  matching  of  maps,  whether  iconic  or  feature-based.  Regarding 
the  combination  of  the  3-D  representations  with  representations  from  other  sensors,  we  need  to  define 
an  algorithm  for  sensor  registration  that  is  general  enough  for  application  to  a  variety  of  situations.  The 
algorithms  presented  in  Section  5  are  still  very  dependent  on  the  sensors  that  we  used,  and  on  the  intended 
application.  Registration  schemes  such  as  [17]  would  enable  us  to  have  a  more  uniform  approach  to  the 
problem.  An  added  effect  of  using  such  a  registration  algorithm  is  that  we  could  explicitly  represent  errors 
caused  by  the  combination  of  the  sensors,  which  we  did  not  do  in  Section  5.  Another  issue  concerns 
our  presentation  of  the  three  levels  of  terrain  representation,  the  matching  algorithms,  and  the  sensor 
combination  algorithms  as  separate  problems.  We  should  define  a  common  perceptual  architecture  to 
integrate  these  algorithms  in  a  common  representation  that  can  be  part  of  the  core  system  of  a  mobile 
robot.  Finally,  we  have  tackled  the  terrain  representation  problems  mainly  from  a  geometrical  point  of 
view.  Except  in  Section  5,  we  did  not  attempt  to  extract  semantic  interpretations  from  the  representations. 
A  natural  extension  of  this  work  is  to  use  the  3-D  terrain  representations  to  identify  known  objects  in  the 
scene.  Another  application  along  these  lines  is  to  use  the  terrain  maps  to  identify  objects  of  interest,  such 
as  terrain  regions  for  sampling  tasks  for  a  planetary  explorer  [24].  Although  we  have  performed  some 
preliminary  experiments  in  that  respect  [19,2],  extracting  semantic  information  from  terrain  representations 
remains  a  major  research  area  for  outdoor  mobile  robots. 
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